IRTPRO™

IRTPRO™ is an advanced application for item calibration and test scoring using item response theory (IRT).  It comes with an intuitive graphical user interface and offers built-in production quality IRT graphics.  Suitable for educators, students, researchers, and assessment organizations, IRTPRO™ has enjoyed increasingly wide usage in the educational, psychological, social, and health sciences.

IRT Models for which item calibration and scoring are implemented in IRTPRO™ are based on unidimensional and multidimensional [confirmatory factor analysis (CFA) or exploratory factor analysis (EFA)] versions of the following widely used response functions:

  • Two-parameter logistic (2PL) (Birnbaum, 1968) [with which equality constraints includes the one-parameter logistic (1PL) (Thissen, 1982)]
  • Three-parameter logistic (3PL) (Birnbaum, 1968)
  • Graded (Samejima, 1969; 1997)
  • Generalized Partial Credit (Muraki, 1992, 1997)
  • Nominal (Bock, 1972, 1997; Thissen, Cai, & Bock, 2010)

These item response models may be mixed in any combination within a test or scale, and any (optional) user-specified equality constraints among parameters, or fixed values for parameters, may be specified. IRTPRO™ currently uses IRT models for single-level multivariate data sets.  If capabilities for handling multi-level data sets are needed, please consider flexMIRT®.

IRTPRO™ implements the method of Maximum Likelihood (ML) for item parameter estimation (item calibration), or it computes Maximum a posteriori (MAP) estimates if (optional) prior distributions are specified for the item parameters.

That being said, alternative computational methods may be used, each of which provides best performance for some combinations of dimensionality and model structure:

  • Bock-Aitkin (BAEM) (Bock & Aitkin, 1981)
  • Bifactor EM (Gibbons & Hedeker, 1992; Gibbons et al., 2007; Cai, Yang & Hansen (2011))
  • Generalized Dimension Reduction EM (Cai, 2010-a)
  • Adaptive Quadrature (ADQEM) (Schilling & Bock, 2005)
  • Metropolis-Hastings Robbins-Monro (MHRM) (Cai, 2010-b, 2010-c)
  • Markov Chain Monte Carlo (MCMC) Patz-Junker’s (1999-a, 1999-b)

The computation of IRT scale scores in IRTPRO™ may be done using any of the following methods:

  • Maximum a posteriori (MAP) for response patterns
  • Expected a posteriori (EAP) for response patterns (Bock & Mislevy, 1982)
  • Expected a posteriori (EAP) for summed scores (Thissen & Orlando, 2001; Thissen, Nelson, Rosa, & McLeod, 2001)

Data structures in IRTPRO™ may categorize the item respondents into groups, and the population latent variable means and variance-covariance matrices may be estimated for multiple groups (Mislevy, 1984, 1985). [Most often, if there is only one group, the population latent variable mean(s) and variance(s) are fixed (usually at 0 and 1) to specify the scale; for multiple groups, one group is usually denoted the “reference group” with standardized latent values.]

To detect differential item functioning (DIF), IRTPRO™ uses Wald tests, modeled after a proposal by Lord (1977), but with accurate item parameter error variance-covariance matrices computed using the Supplemented EM (SEM) algorithm (Cai, 2008).

Depending on the number of items, response categories, and respondents, IRTPRO™ reports several varieties of goodness of fit and diagnostic statistics after item calibration. The values of –2 log likelihood, Akaike Information Criterion (AIC) (Akaike, 1974) and the Bayesian Information Criterion (BIC) (Schwarz, 1978) are always reported. If the sample size sufficiently exceeds the number of cells in the complete cross-classification of the respondents based on item response patterns, the overall likelihood ratio test against the general multinomial alternative is reported. For some models, the M2 statistic (Maydeu-Olivares & Joe, 2005, 2006; Cai, Maydeu-Olivares, Coffman, & Thissen, 2006) is also computed. Diagnostic statistics include generalizations for polytomous responses of the local dependence (LD) statistic described by Chen & Thissen (1997) and the SS-X2 item-fit statistic suggested by Orlando & Thissen (2000, 2003).