Norm groups in modern tests
Why are norm groups necessary for classical tests, but not for modern tests?
Morgan Pihl avatar
Written by Morgan Pihl
Updated over a week ago

For most psychological tests based on classical test theory, norm groups are necessary to calculate results. Typically, the sum of scores from the items in the test is transformed to a standard scale using the mean and standard deviation of the norm group. Since the norm group defines the properties of the scale, it is important to make sure that the quality of the norm group is adequate. 

Alva's tests are built on modern test theory, also called Item Response Theory (IRT). In this context, the equivalent to the norm group is the data used to train the statistical model. 

For example, when developing Alva's adaptive logic test, data from 2,585 individuals were used to estimate the difficulty of 84 tasks, how well they differentiate between high and low ability, and the probability of guessing the correct answer. These item parameters are then taken into account when calculating results for the logic test.

With IRT, the parameters can be updated continuously as new data is collected. This way, the results get better and better over time. In comparison, norm groups are static.

As a consequence, results from modern tests are not dependent on any single norm group. Instead, they depend on the item parameters that are estimated using data. The process of estimating item parameters is achieved by machine learning, which allows us to draw upon cutting-edge research in statistical modeling, optimization and probabilistic programming. Classical test theory, on the other hand, rests on statistical theory from the late 19th century. 

At Alva, we use Bayesian Inference in PyMC3 (Salvatier, Wiecki & Fonnesbeck, 2016) to construct scales that match the population of working adults. Our models are inspired those described by Luo and Jiao (2018). 

In summary, results from Alva's tests should be interpreted with the population of working adults in mind. We are determined to provide results that are valid and well-calibrated for this purpose.

Read more about the development of the logic test here and the personality test here.


Luo, Y. & Jiao, H. (2018). Using the Stan Program for Bayesian Item Response Theory. Educational and Psychological Measurement, 78(3), 384-408. DOI:10.1177/0013164417693666 

Salvatier J., Wiecki T.V., Fonnesbeck C. (2016) Probabilistic programming in Python using PyMC3. PeerJ Computer Science, 2:55. DOI: 10.7717/peerj-cs.55

Did this answer your question?