Dealing with Data

A collection of posts about different kinds of data and ways to analyze and visualize them.

Visualizing the PISA dataset Part 1 - Preparing the data


The Programme for International Student Assessment (PISA) is a study of the Organisation for Economic Co-operation and Development (OECD) designed to determine and compare the scholastic performances of 15-year olds around the world. In addition to results from assessments of the student's math, science, and reading skills, the PISA dataset also contains plenty of background information about the students and their education.

Recently, a PISA visualization contest was held at the useR! 2014 conference with the aim to show the potential of R for analysis and visualization of large and complex data sets. The contest tracks were designed to answer one of two questions. Track 1: Schools matter - the importance of school factors in explaining academic performance and Track2: Inequalities in academic achievement.

In this post I'd like to present an approach to analyze the PISA dataset to find an answer to the first question by taking a look at the dataset of the 2012 PISA study. It contains data on 485 490 students from 68 different countries.


Data

The dataset used here is freely available on the PISA homepage and contains the information about the assessment of the students as well as background information derived from a questionnaire answered by the students. Since the data is already stored in a .RData file, I am going to use R for the analysis.

library("ggplot2")
library("lattice")
library("gridExtra")
library("Hmisc")


As a first step we load the dataset and take a look at the dimensions.

load("student2012.rda")
dim(student2012)
## [1] 485490    635

There are 635 different values for each of the 485 490 students.


Variable selection

With the large number of columns in the dataset we can use any help we can get to select meaningful variables. Luckily, there is a codebook describing each of the variables. To do the analysis we first have to develop an understanding of the meaning of the question we are trying to answer. Let's formulate a question that we can directly access with the data available to get a clear idea what we are looking for in the data: “How does the performance of a student in a given subject depend on the support they get at school?” There are plenty of indicators for school support. A very simple one is the time the students spend in class.

For each student we can find the number of class periods a week for a given subject and the average minutes per class period. According to the codebook they are stored in the columns ST69Q01–ST69Q03 for the length of the class periods and ST70Q01–ST70Q03 for the number of class periods. These values differ from country to country and even school to school. I am therefore going to split the dataset into a list of datasets, one for each country. We can then look at the first country (Albania) and see if there is a trend of performance with learning time. For this purpose I am calculating a linear model with R's lm function while restricting the analysis to one subject (math). There are 5 different independent estimations of the student's math performance, PV1MATH–PV5MATH. They are all highly correlated and it is fine to just pick one of the values for the following analysis.

cnt.list <- split(student2012, student2012$CNT)

cnt.data <- cnt.list[[1]]
model.data <- data.frame(t.school=cnt.data$ST69Q02 * cnt.data$ST70Q02,
                         score=cnt.data$PV1MATH)
mod.lm <- lm(score ~ t.school, data=model.data)
summary(mod.lm)
## 
## Call:
## lm(formula = score ~ t.school, data = model.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -321.50  -56.71    2.72   61.18  262.68 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.924e+02  6.584e+00  59.594   <2e-16 ***
## t.school    1.252e-03  3.701e-02   0.034    0.973    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 91.5 on 2607 degrees of freedom
##   (2134 observations deleted due to missingness)
## Multiple R-squared:  4.387e-07,  Adjusted R-squared:  -0.0003831 
## F-statistic: 0.001144 on 1 and 2607 DF,  p-value: 0.973

The summary shows that the slope of the model is 1.252e-03, meaning that there is no recognizable trend. Furthermore, the p-value, i.e. the indicator of the statistical significance of the model, is high, suggesting that changes in performance are not related to changes in learning time.

It seems that at least for Albania there is no trend that would suggest that studying more results in better performances. But what about the other countries? We can use the same systematic to check for trends. Also, I would also like to get an estimate for the time the students spend studying at home, or generally outside of school, by including the variable ST57Q01. According to the codebook, this is the number of hours spend on homework. This way we can compare the effect of the two times and get a better feeling for the impact of the work done at school.

slopes.home <- numeric()
slopes.school <- numeric()
pvals.home <- numeric()
pvals.school <- numeric()
for (i in seq_along(cnt.list)) {
  cnt.data <- cnt.list[[i]]
  model.data <- data.frame(t.school=cnt.data$ST69Q02 * cnt.data$ST70Q02,
                           t.home=cnt.data$ST57Q01 * 60.,
                           score=cnt.data$PV1MATH)
  mod.lm.school <- lm(score ~ t.school, data=model.data)
  mod.lm.home <- lm(score ~ t.home, data=model.data)
  slopes.school <- c(slopes.school, mod.lm.school$coefficients[2])
  slopes.home <- c(slopes.home, mod.lm.home$coefficients[2])
  pvals.school <- c(pvals.school, summary(mod.lm.school)$coefficients[2, 4])
  pvals.home <- c(pvals.home, summary(mod.lm.home)$coefficients[2, 4])
}


We have now determined the trends for each country and can compare them in a plot.

plot.data.school <- data.frame(country=names(cnt.list), 
                               slope=slopes.school, 
                               pval=cut2(pvals.school, g=4))
g <- ggplot(data=plot.data.school, aes(x=country, y=slope, fill=pval)) 
g <- g + geom_bar(stat="identity", postition="identity", width=0.8, alpha=0.5)
g <- g + coord_flip() 
g <- g + geom_hline(xintercept=0, color="red")
g <- g + theme_bw()
g
## Warning: Stacking not well defined when ymin != 0

plot of chunk unnamed-chunk-5

The plot shows that there are in fact a few countries with no, or sometimes even negative, correlations between learning times and performance. The majority of the countries, however, are in the positive regime and show statistically significant trends.


We can pick a specific country to further investigate the relationship between studying time and performance. Let's take the USA. I would have chosen Germany, but it is one of the countries that shows a negative correlation and would destroy the uplifting spirit of this post.

data.usa <- cnt.list[[which(names(cnt.list) == "United States of America")]]
model.data <- data.frame(t.school=(data.usa$ST69Q02 * data.usa$ST70Q02),
                         t.home=data.usa$ST57Q01 * 60.,
                         score=data.usa$PV1MATH)
model.data <- model.data[complete.cases(model.data), ]


LOESS smoothing

To better represent the trends in the data we are now using a local regression (LOESS) approach. In this method models are fitted to only a subset of the data and are then combined to better represent the overall distribution of the data points and account for variations on small scales.

g1 <- ggplot(aes(x=t.school, y=score), data=model.data)
g1 <- g1 + geom_point(color="blue", alpha=0.5, size=3)
g1 <- g1 + stat_smooth(method="loess", span=1., size=2, color="red")
g1 <- g1 + labs(x="learning time school [min]")
g1 <- g1 + theme_bw()

g2 <- ggplot(aes(x=t.home, y=score), data=model.data)
g2 <- g2 + geom_point(color="blue", alpha=0.5, size=3)
g2 <- g2 + stat_smooth(method="loess", span=1., size=2, color="red")
g2 <- g2 + labs(x="learning time home [min]")
g2 <- g2 + theme_bw()

grid.arrange(g1, g2, nrow=2)

plot of chunk unnamed-chunk-7

As can be seen in the top panel the performance of the students increases up to a learning time of approximately 500 min per week. Thereafter, the uncertainty of the fit increases due to the low number of points available for these times. The gray shaded area shows the 95% confidence interval. Because it is so wide, it's not clear if the performance increases or decreases. The trend for the homework time in the bottom panel is more homogeneous and with a smaller uncertainty.

It is interesting that in both panels there is such a high range in math study times for students from the same country ranging from 0 to over 1200 min a week spend in school and even more at home. We don't have a way to check if a study time of more than 20 hours math a week in the classroom is unreasonable since the PISA dataset simply presents the answers to the questionnaire of the students. So for now we assume that the questionnaires were all filled in correctly and give a realistic representation of the actual times spend on studying math.

It is reasonable to assume that the math performance depends on both, the time studied at school and the time they dedicate to homework. We are therefore going to construct a LOESS model that takes both times into account simultaneously.

mod.loess <- loess(score ~ t.school + t.home, data=model.data, span=1.)
summary(mod.loess)
## Call:
## loess(formula = score ~ t.school + t.home, data = model.data, 
##     span = 1)
## 
## Number of Observations: 3020 
## Equivalent Number of Parameters: 6.59 
## Residual Standard Error: 80.26 
## Trace of smoother matrix: 7.16 
## 
## Control settings:
##   normalize:  TRUE 
##   span       :  1 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2


Results

We can now plot the result to get an idea of the trend that the LOESS model describes.

fit.data <- data.frame(x=model.data$t.school, 
                       y=model.data$t.home,
                       z=mod.loess$fitted)
with(fit.data, cloud(z ~ x + y, xlab="time school", 
                                 ylab="time home",
                                 zlab="score",
                     col="blue", lwd=2))

plot of chunk unnamed-chunk-9

This is a nice representation of the dependence of math score with time studied in school and at home. We can see that for low times the score is low but increases when more time is spend studying in school or at home. This example only describes the trend found for US schools but can easily be extended to other countries.

As a last step I will write out the data of the model fit to be able to produce a nicer version of the plot. I will do this with the Persistence of Vision Raytracer (POVray).

normalize <- function(x){
               (x-min(x, na.rm=TRUE))/(max(x, na.rm=TRUE)-min(x, na.rm=TRUE))}
fit.data <- as.data.frame(apply(fit.data, 2, normalize))
model.data <- as.data.frame(apply(model.data, 2, normalize))

# Write out the loess fit 
write.table(fit.data, "plot_fit.csv", sep=",", quote=FALSE,
            row.names=FALSE, col.names=FALSE, eol=",")

# Write also out the data
write.table(model.data, "plot_data.csv", sep=",", quote=FALSE,
            row.names=FALSE, col.names=FALSE, eol=",")

These files can now be used to produce a three-dimensional graphic of the plot.