Computerized adaptive testing (CAT1) refers to the procedures underlying a test/assessment/whatevs that is generated “on the fly” by a computer, using a person’s previous answers to determine what question the person will see next. My amazing, trade-mark infringing, artwork notwithstanding, CAT is not magic – it is, in fact, SCIENCE!
A reader-friendly conceptual overview of CAT was recently published here: https://www.ispor.org/ValueOutcomesSpotlight (paywall warning!). I’ll wait while you go read that…
Finished? Great!
OK – so now we know that CAT is based on some pretty straight-forward ideas which are implemented with fancy statistics. Mirroring an example in the paper (for those too cool or not ISPOR-cool enough to read it), if someone answers “Strongly Agree” to “I have thought about ending my life” it doesn’t make much sense (or give us new information) to then ask them, “I felt blue”. While this item selection method is logical (and statistically sound), skeptics might think that it is problematic to give different people different items. Well, the skeptics are WRONG. Harnessing the power of SCIENCE!, CAT produces scores are all directly comparable to one another.
The technology that makes such fanciness possible is item response theory (IRT). When using more traditional scale development methods and scoring schemes (like Cronbach’s alpha and summed scores), changes to items alter the meaning of the test and the scores that come from the test. With IRT, when a scale is initially developed, analyses are run to get numerical descriptors of how each item performs/relates to the construct being measured (math, depression, etc.). These “item performance” numerical descriptors are used in calculating IRT-based scores, rather than simply adding up a person’s responses. The theory underlying the more complex scoring that IRT uses is what allows us to have different respondents see different items (have a CAT-administered test) but still get scores that are all on the same metric and have the same interpretation. Because of this “item-swapping” awesomeness, it also means that an IRT score from a 20-item version of a scale means the same as a score from a 5-item short form of the same scale which ALSO means the same as a score from a CAT-administration of that scale. That is, IRT scores are IRT scores and for any set of items that have the proper initial analyses done, IRT-based scores from those items are directly comparable to each other, regardless of the subset of items used, how many total items were seen, or whether a paper form was filled out or a computer was used to collect data.
And that’s pretty darn nifty – perhaps even deserving of a Ms. Frizzle-style
1: CAT has nothing to do with Andrés Galarraga (The Big Cat), Leon Lett (also The Big Cat), Andy Katzenmoyer (The Big Kat), Michelle Pfeiffer (awesome Catwoman), Halle Berry (terrible Catwoman), or Catepillar-brand heavy equipment.