In 2019, the adaptive version of Alva’s logic test was developed entirely using Item Response Theory (IRT). The data used in the process consisted of responses to the existing 45 tasks by 2,299 Alva users and responses to 50 newly designed tasks by an additional standardization sample consisting of 286 participants.
An important part of developing an IRT-test is parameter estimation. In the 3 parameter logistic model, items are characterized by their location (difficulty), scale (discrimination) and lower asymptote (guessing). Finding the optimal estimates for these parameters is a complicated task. In a setting like this, with a complex model with many parameters relative to the number of observations, Bayesian estimation is ideal. We used state-of-the-art technology implemented in the probabilistic programming language PyMC3 to build the model and estimate model parameters.
As a first step, we estimated parameters for the 45 tasks in the linear test using data from Alva’s platform (N=2,299, 65% males, 34% females; age mean=29.5, SD=9.2). The number of observations per task ranged between 1,743 and 2,299, with a mean of 2,133 and a standard deviation of 252. Tasks that were not reached due to participants running out of time were filtered out.
In the second step, the new tasks were divided into 3 parallel sets. Old tasks were also included, so that there was an overlap of 15 tasks between the sets. Combining both old and new tasks, each set consisted of 39 tasks. Using Amazon Mechanical Turk, 286 participants were recruited and assigned to one of the sets randomly (66% males, 33% females; age mean=33.8, SD=9.3; 27% secondary education, 60% bachelor’s degree or equivalent, 11% master’s degree or equivalent).
Parameter estimation for the entire bank of tasks was performed using item parameters from the first step as priors in the Bayesian model. This, together with the fact that some tasks were overlapping, ensures that the parameters for the new tasks are on the same scale as parameters for the old tasks. Normed results from Raven’s Standard Progressive Matrices - Plus version (SPM+) for 134 participants was used as priors for the ability parameter, to anchor the scale at appropriate values. This procedure of simultaneous estimation with informative priors means that all available information - from the large sample of Alva users and previous testing with SPM+ - is explicitly taken into account in the final parameter estimates.
During standardization, two calibration studies were conducted. First, a sample of 134 participants completed both Alva’s logic test and Raven’s SPM+. Participants’ percentile scores for SPM+ were estimated using two norm tables presented in the test manual (Tables D. 11 and D. 13). These norms groups were collected in Germany (Bulheller & Hacker, 1999), and in Poland (Aaron, Jackson & Seerden, 2000). Percentile scores for Alva’s test were estimated by transformation of the standard scores using the gaussian cumulative density function. Results were comparable to normed scores from SPM+ as can be seen in the table below.
Percentile score distributions for two norm groups of SPM+ and Alva’s logic test
The second calibration study consisted of 55 members of Mensa Sweden who completed the adaptive version of Alva’s logic test. Mensa is an organization for highly intelligent individuals and only those with a measured IQ above 130 are admitted. Since Alva’s logic test measures an ability that is closely related to IQ, Mensa members should achieve results above average. Specifically, given some measurement error in the Mensa admission process, the average result for Mensa members should be close to 9.1 with a standard deviation close to 1.2*. The observed average score was 8.9 and the observed standard deviation was 1.2 which is comparable to the expected values. The expected (simulated) and observed standard score distributions are shown in the graph below.
* These figures are based on a simulation, where 100,000 samples were drawn from the IQ score distribution (gaussian with a mean of 100 and a standard deviation of 15) to represent true ability levels and random noise was added (gaussian with a mean of 0 and a standard deviation of 9) to represent measurement error. Results above 130 were labeled ‘admitted’ and the mean and standard deviation of the true ability scores were calculated and transformed to the standard ten scale (gaussian with a mean of 5.5 and a standard deviation of 2).
Luo, Y. & Jiao, H. (2018). Using the Stan Program for Bayesian Item Response Theory. Educational and Psychological Measurement, 78(3), 384-408. DOI:10.1177/0013164417693666
Salvatier J., Wiecki T.V., Fonnesbeck C. (2016) Probabilistic programming in Python using PyMC3. PeerJ Computer Science, 2:55. DOI: 10.7717/peerj-cs.55