Exploration Of Student Performance Notebook

The Data Set

The dataset for this exploration was taken from the popular website kaggle see source.
This data contains 5 features and 3 exam scores.

Features

Name	Description	Type
gender	male or female	object
race/ethnicity	Groups A-E	object
parental level of education	how educated were the students parents	object
lunch	whether the student got a reduced or standard lunch	object
test preparation course	completed or none done at all	object

Exam Scores

math score (0-100)
reading score (0-100)
writing score (0-100)

The Goal

The goal of this exploration is to determine relationship of the features on the students performance. We will attempt to do this using some groupby aggregation in pandas and some visualizations using seaborn all in a python jupyter notebook. Finally we will construct a random forest based model using sklearn to try to predict a given students performance.

EDA

import pandas as pd

df = pd.read_csv('https://github.com/Aidzillafont/Student-Performace-/blob/37be2cfff7c6c02ecb002231fb88e0b8647cc0b3/StudentsPerformance.csv?raw=true')

df.head()

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
0	female	group B	bachelor's degree	standard	none	72	72	74
1	female	group C	some college	standard	completed	69	90	88
2	female	group B	master's degree	standard	none	90	95	93
3	male	group A	associate's degree	free/reduced	none	47	57	44
4	male	group C	some college	standard	none	76	78	75

df.dtypes

gender                         object
race/ethnicity                 object
parental level of education    object
lunch                          object
test preparation course        object
math score                      int64
reading score                   int64
writing score                   int64
dtype: object

df.describe()

	math score	reading score	writing score
count	1000.00000	1000.000000	1000.000000
mean	66.08900	69.169000	68.054000
std	15.16308	14.600192	15.195657
min	0.00000	17.000000	10.000000
25%	57.00000	59.000000	57.750000
50%	66.00000	70.000000	69.000000
75%	77.00000	79.000000	79.000000
max	100.00000	100.000000	100.000000

As can be seen there is a number of scores on maths, reading and writing. These are for varying types of people. Lets aggregate some of the groups to see if there is an things standing out.

features = []
for col in df.columns:
  if df[col].dtype==object:
    features.append(col)
    print(df.groupby(col).agg(['mean']).round(2),'\n')

       math score reading score writing score
             mean          mean          mean
gender                                       
female      63.63         72.61         72.47
male        68.73         65.47         63.31 

               math score reading score writing score
                     mean          mean          mean
race/ethnicity                                       
group A             61.63         64.67         62.67
group B             63.45         67.35         65.60
group C             64.46         69.10         67.83
group D             67.36         70.03         70.15
group E             73.82         73.03         71.41 

                            math score reading score writing score
                                  mean          mean          mean
parental level of education                                       
associate's degree               67.88         70.93         69.90
bachelor's degree                69.39         73.00         73.38
high school                      62.14         64.70         62.45
master's degree                  69.75         75.37         75.68
some college                     67.13         69.46         68.84
some high school                 63.50         66.94         64.89 

             math score reading score writing score
                   mean          mean          mean
lunch                                              
free/reduced      58.92         64.65         63.02
standard          70.03         71.65         70.82 

                        math score reading score writing score
                              mean          mean          mean
test preparation course                                       
completed                    69.70         73.89         74.42
none                         64.08         66.53         64.50 

A quick review of the above we can make some of the following observations, in this data set we can see that:

Women score on average higher than men in every subject excluding maths
Group A perform poorly where Group E are on average score the highest
The higher your parents education is the higher yours should be.
Students who received a free/reduced lunch performed worse than those who received a standard lunch.
Students who did a test preparation course performed better in exams.

Lets try an visualize some of these patterns in a few visualizations

Visualizations

import seaborn as sns
import matplotlib.pyplot as plt
#targets = ['math score', 'reading score', 'writing score']

#first we need to melt the data set to pass to out seaborn box plot
df_melt = pd.melt(df, id_vars=features, var_name='exam type', value_name='score')

for feat in features:
  plt.figure(figsize=(12,6))
  plot = sns.boxplot(x=feat,hue='exam type', y='score', data=df_melt)

png

The above plots visualize the spread of scores for each subject across each variable. As we can see here even though we observed different means there is a lot of overlap. This suggests that given you may have features that should reduce your score you can over come it.

However if you have every thing against you or everything for you this does seem to have a clear indication of performance. below we can see results of the students with the worst set of features vs those with the best. Here you can see a clear separation

df_worst = df[(df['gender']=='male')&(df['race/ethnicity']=='group A')&(df['parental level of education']=='high school')&(df['lunch']=='free/reduced')&(df['test preparation course']=='none')]
df_best = df[(df['gender']=='female')&(df['race/ethnicity']=='group E')&(df['parental level of education']=='master\'s degree')&(df['lunch']=='standard')&(df['test preparation course']=='completed')]

df_worst

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
395	male	group A	high school	free/reduced	none	48	45	41
688	male	group A	high school	free/reduced	none	53	58	44
811	male	group A	high school	free/reduced	none	45	47	49

df_best

	gender	race/ethnicity	parental level of education	lunch	test preparation course	math score	reading score	writing score
685	female	group E	master's degree	standard	completed	94	99	100
995	female	group E	master's degree	standard	completed	88	99	95

Its worth noting that a lot of these features are out of the students control. In fact the only two that a prospective student can control is their lunch and test course preparation. So lets look at the combined effect of these in some plots and groupings and see how the separate a students performance

df.groupby(['lunch', 'test preparation course']).agg(['mean'])

		math score	reading score	writing score
		mean	mean	mean
lunch	test preparation course
free/reduced	completed	63.045802	69.870229	70.351145
free/reduced	none	56.508929	61.602679	58.736607
standard	completed	73.533040	76.215859	76.766520
standard	none	68.133971	69.177033	67.595694

pd.options.mode.chained_assignment = None 

df_melt_fn = df_melt[(df_melt['lunch']=='free/reduced')&(df_melt['test preparation course']=='none')]
df_melt_sc = df_melt[(df_melt['lunch']=='standard')&(df_melt['test preparation course']=='completed')]

df_melt_fn.loc[:,'prep and lunch']='neither'
df_melt_sc.loc[:,'prep and lunch']='both'

df_melt2 = df_melt_fn.append(df_melt_sc, ignore_index=True)
plot = sns.boxplot(x='prep and lunch',hue='exam type', y='score', data=df_melt2)

png

As can be seen from the above plot people who have improved there lunch and taken an test preparation course have seen better results according to the data set.

Build a Pipeline and Model

Now we have done our EDA showing how all of the feature variables separate performance of students lets build a model to predict performance of a student in the maths subject.

First we need to make a test train split

from sklearn.model_selection import train_test_split
target = ['math score']
X, y = df[features], df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=69)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(700, 5) (300, 5) (700, 1) (300, 1)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

pipe = Pipeline([('hot', OneHotEncoder()), ('rf', RandomForestRegressor())])

param_grid = {
    "rf__n_estimators": [300, 500, 1000],
    "rf__criterion": ['absolute_error'],
    "rf__max_depth": [None,2,3,4],
}


cv = GridSearchCV(pipe, param_grid, n_jobs=-1, cv=3, verbose=10, scoring='neg_mean_absolute_error')
cv.fit(X_train, y_train.values.ravel())

print('The best score was: ', cv.best_score_, ' with the following parameters.\n')
print(cv.best_params_)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
The best score was:  -11.100714033886424  with the following parameters.

{'rf__criterion': 'absolute_error', 'rf__max_depth': 4, 'rf__n_estimators': 300}

import numpy as np
from sklearn.metrics import mean_absolute_error
preds = cv.best_estimator_.predict(X_test)
#print(np.sqrt(np.mean(np.power(preds - y_test.values,2))))
mean_absolute_error(preds,y_test)

10.68906111111111

Here we can see our model gets a absolute error of 11.1 and 10.69 on train and test respectively. This means our model was on average predicting a score around 11 percentage points away from our true score.

Lets change our regression problem into a classification problem by changing scores to grades following the below schema

Grade	Score
A	90-100
B	70-89
C	50-69
D	40-49
F	0-39

In python we can very easily build a function to do this.

def grade_conv(x):
  if x>=90:
    return 'A'
  elif x>=70:
    return 'B'
  elif x>=50:
    return 'C'
  elif x>=40:
    return 'D'
  else:
    return 'F'

apply the function to our y_test and y_train

y_train_grade = y_train[target[0]].apply(lambda x: grade_conv(x))
y_test_grade = y_test[target[0]].apply(lambda x: grade_conv(x))

Below we build a random forest classifier and run a grid search cross validation to train our hyper parameters.

from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([('hot', OneHotEncoder()), ('rf', RandomForestClassifier())])

param_grid = {
    "rf__n_estimators": [300, 500, 700],
    "rf__criterion": ['gini', 'entropy'],
    "rf__max_depth": [None,2,3,4],
    "rf__max_features": ['auto','sqrt','log2',None]
}


cv = GridSearchCV(pipe, param_grid, n_jobs=-1, cv=3, verbose=10)
cv.fit(X_train, y_train_grade)

print('The best score was: ', cv.best_score_, ' with the following parameters.\n')
print(cv.best_params_)

Fitting 3 folds for each of 96 candidates, totalling 288 fits
The best score was:  0.49283469669735763  with the following parameters.

{'rf__criterion': 'gini', 'rf__max_depth': 2, 'rf__max_features': 'sqrt', 'rf__n_estimators': 700}

from sklearn.metrics import accuracy_score
preds = cv.best_estimator_.predict(X_test)
accuracy_score(preds,y_test_grade)

0.53

here we get an accuracy score of 53% which is only slightly better than guessing the most common class the C grade

Conclusions

Where the random forest model did not perform as well as hoped we can certainly look to investigate more predictive models in the future.

As for insights to the data they are summarised by:

Women score on average higher than men in every subject excluding maths
Group A perform poorly where Group E are on average score the highest
The higher your parents education is the higher yours should be.
Students who received a free/reduced lunch performed worse than those who received a standard lunch.
Students who did a test preparation course performed better in exams.

We saw these by looking at some bar-plots using seaborn along with some groupby in pandas

If you would like to check out this notebook yourself please take a look at my github