Tuesday, April 11, 2017

Testing Rasch model in R with eRm

Motivation & Preparation

My first actual post is dedicated to the Rasch model. It is one of the most important test models in psychometrics because it features invariant comparisons of persons and items. This allows you e.g., to compare test scores of individuals fairly, even when they answered different questions (items). Drawbacks of the Rasch model are

  1. it’s very strict and
  2. it’s not implemented in standard statistical software (e.g., SPSS)

The first problem is indeed a little tricky and a lot of domain knowledge and psychometric experience is necessary to create items that fit the Rasch model. The second on the other hand can be solved with various software packages. Today I want to show you how it can be done in R with the eRm package.

But first we have to make some preparations.

## load packages
library(eRm)
library(readr)
library(tidyverse)

## get data to test
filename <- "rasch.csv"

if (!file.exists(filename)) {
    res <- tryCatch(download.file("http://bit.ly/rasch_sim", destfile = filename, 
        method = "auto"), error = function(e) 1)
}

test_data <- read_csv(filename, col_names = T)

Let’s take a fast look at the first few lines our data:

test_data %>%
  head() %>%
  knitr::kable(booktabs = TRUE)
u1 u2 u3 u4 u5 u6 u7 u8 u9 u10 group
1 1 1 1 1 1 1 1 1 0 1
1 1 1 1 0 0 1 0 0 0 1
1 1 1 0 0 0 0 0 0 0 1
1 1 1 1 0 1 0 0 0 0 1
1 1 1 1 1 1 1 0 0 0 1
1 1 0 0 1 1 0 0 0 0 1

The data import seems to have worked fine and we are ready for the analysis.

Test the Rasch model

We start by estimating the Rasch model

model <- test_data %>%
  select (u1:u10) %>%
  RM

Then we compare the model for subgroups within our data - the more, the better. In our case we have the grouping variable group and low vs high scorer (split at the median). This guarantees that the items measure the same construct and not other group specific features.

LR_group <- LRtest(model, splitcr = test_data[["group"]])
LR_median <- LRtest(model, splitcr = "median")
#> Warning in LRtest.Rm(model, splitcr = "median"): 
#> The following items were excluded due to inappropriate response patterns
#> within subgroups:
#> u1 u2
#> 
#> Full and subgroup models are estimated without these items!
print("LR Test for group:")
#> [1] "LR Test for group:"
LR_group
#> 
#> Andersen LR-test: 
#> LR-value: 9.451 
#> Chi-square df: 9 
#> p-value:  0.397
print("LR Test for median:")
#> [1] "LR Test for median:"
LR_median
#> 
#> Andersen LR-test: 
#> LR-value: 6.469 
#> Chi-square df: 7 
#> p-value:  0.486

Both Andersen’s LR-tests are not significant, i.e. the p-value is greater than .05. Hence, the Rasch model seems to hold in this case. Don’t feel bad if that’s not the case for your data. I have cheated and simulated mine, but real data hardly ever fits as good in the first run. Even in good tests I would expect a few items that just don’t fit. To find them use item-specific Wald test.

Waldtest(model, splitcr = test_data[["group"]])
#> 
#> Wald test on item level (z-values):
#> 
#>          z-statistic p-value
#> beta u1       -0.039   0.969
#> beta u2       -1.704   0.088
#> beta u3        0.650   0.516
#> beta u4        0.627   0.531
#> beta u5       -0.669   0.503
#> beta u6        0.429   0.668
#> beta u7        2.271   0.023
#> beta u8        0.541   0.588
#> beta u9        0.070   0.945
#> beta u10      -0.488   0.625
Waldtest(model, splitcr = "median")
#> Warning in Waldtest.Rm(model, splitcr = "median"): 
#> The following items were excluded due to inappropriate response patterns
#> within subgroups:
#> u1 u2
#> 
#> Subgroup models are estimated without these items!
#> 
#> Wald test on item level (z-values):
#> 
#>          z-statistic p-value
#> beta u3       -0.191   0.849
#> beta u4       -0.095   0.924
#> beta u5        0.148   0.883
#> beta u6       -0.844   0.399
#> beta u7        1.407   0.159
#> beta u8        1.596   0.110
#> beta u9        1.094   0.274
#> beta u10      -0.802   0.423

In this case neither split criterion revealed any troubles. The z-statistic for u7 is a little high, but given that multiple tests were carried out nothing to worry about (a fair p-value would be .05 divided by the number of tests).

However, if your LR test was significant, then some items have to be removed from the test and those with large z-values in the Wald test are a good place to start. Do those items have special features that set them apart from the others in favoring one split group over the other? Don’t just kick out the item with the largest z-value, but try to reflect your construct theorem before your decision.

Visualize the Rasch model

Another option to identify problematic items is to look at the Goodness-of-fit plot.

LR_group %>% plotGOF(conf = list(ia = FALSE, col = "blue", lty = "dotted"))

The further an item is away from the line, the more problematic it is. At least the confidence ellipses should meet the line.

In the Goodness-of-fit plot you get a hint of the item difficulties (their location parameter) as well. Items in the upper right corner are more difficult than in the lower left. However, there are more informative plots for the job.

One way is to plot the item characteristics curves.

model %>% plotICC(item.subset = 1:4, empCI = TRUE, mplot = TRUE, legpos = FALSE, ask = FALSE)

For brevity I have requested the ICC only for the first four items, but I hope you get the point. The ICCs tell us where an item is informative, i.e. not (almost) zero or one. Negative values stand for sub-par performance, positive for above average.

Another option to do that, is to create a Person-Item map.

## get person parameters
persons <- test_data %>%
  select (u1:u10) %>%
  PCM

## plot map
plotPImap(persons, sorted = TRUE)

In the Person-Item Map we can see the distribution of the person parameter (the ability of our test persons) and where our items are informative the most. In our case it is a pretty good match. The person parameter has a peek in the middle, which is also the zone where more informative items are located. If you have a special target group for your test, e.g., experts, you might want to increase the number of items for their ability range. This way the test becomes more informative and reliable in this region. You could even make your test adaptive and only use items which are most informative for the specific test person. With negligible loss in reliability this can save huge amounts of time and due to the nice properties of the Rasch Model person parameters are still fully comparable.

To visualize in which region your set of items is most informative you can plot the Test Information.

model %>% plotINFO(type = "item")