Sorry to disappoint, but this is not a post about Spongebob’s favorite jellyfishing net – it’s about something way more awesome, the psychometric concept of RELIABILITY!

Reliability refers to the idea that you’re measuring whatever it is you’re measuring in a repeatable and precise fashion. For a scale, this would mean that if you step on, get a reading of 150.2 lbs, step off and step back on, the scale should say close to 150.2 again. If you get on and off the scale and it reads 98.6 both times, it’s likely you’re standing on a thermometer (and you’re a weirdo). Whether or not the reported number means what you think it means (that is, is the “scale” measuring your weight or temperature?) is a question of validity, which is a whole different can of worms.

One way to think about measurement is based around the idea that, for each construct/latent variable/ concept being measured, there is a “true” score that is inherent in the person and there’s our best guess of what her score is, made based on her answers to questions, which will likely be a little wrong. Something like

Score = truth + error

Reliability is solely concerned with summarizing how much “true” information is provided by a number/score, compared to how much garbage (error/wrongness) there is. Assessing the reliability of tests/assessments/metrics is more complicated than just stepping on and off a scale, but several different psychometric methods are available.

The most common method for assessing reliability is the Classical Test Theory metric of Cronbach’s alpha. Alpha provides a single numeric value between 0 and 1 (with higher being better) that summarizes the reliability of a set of items. Alpha is often *said* to measure the interrelatedness or internal consistency of a set of items (how well that set of items “sticks together,” in some quasi-mathematical sense), which we’ll get to…

While alpha is the go-to thing to report for scales/tests/assessments in journal articles, to say that alpha is well-understood by the average reader or researcher is bordering on an outright lie. For example, it’s a safe bet that Joe Schmo don’t know that in most situations, alpha is only a lower-bound estimate for reliability and a poorly estimated one at that (e.g., Sijtsma, 2009). For alpha to be equal to the true reliability of a test, every item must be an equally good indicator of the underlying construct being measured (e.g., Green & Yang, 2009), a property rarely seen in the real world. Further, contrary to the internal consistency meaning attributed to alpha, a reported alpha value tells you nothing about the underlying structure of the test (e.g., Schmitt, 1996). That is, a test that truly measures a single underlying dimension (e.g., only math ability) can have a low alpha and a test that measures multiple constructs (e.g., math and English ability) may have a high alpha value.

In item response theory (IRT), reliability (in the single number summary sense above) isn’t really useful. Contrary to the “all items are equally good” assumption implicit in alpha, IRT explicitly models how well each item measures the latent construct (via the discrimination/slope parameter), in addition to how difficult an item is (via the b/difficulty/severity/threshold parameter(s)). Because the items are not uniform in their performance, that means that the scores which come from the items will also not be uniform in their precision/reliability. In acknowledgement of that, it’s common in unidimensional IRT to provide a plot called the Test Information Function (TIF), which describes how well the test measures people across the whole continuum of possible scores.

Information is used in an intuitive way, where more information means we’re more sure of the scores. However, while more information is better, there’s not really a good way to know how much information is “enough,” akin to the common “alpha =.70 is high enough for research use” recommendation (which is also a can of icky worms to pop open [e.g. Schmitt, 1996]). A useful trick that IRT nerds sometimes use is plugging information values into a formula to get reliability values that regular nerds can more readily interpret. From this, you can get a reliability plot, like below.

So now we can compare this plot to some rule of thumb (like the 0.70 minimum stated earlier) and say something about the suitability of the scores. And because we’re not assuming that reliability is constant across the scale (which alpha implicitly does), this plot provides a lot more detailed information about the trustworthiness of scores across the full range (e.g., low, moderate, high). In the above plot, the scores given to people ranging from 0 to about 85 are extremely reliable, meaning if someone gets a score of 28, we are quite confident that his “true” score is really close to 28. But around 85, the reliability drops off. This means that we should be less confident in scores at the high end of the plot– that is, a person gets a score of 87, but we’re not positive that his “true” score is 87, it might be 84 or 91. Because the scale provides less information/ is less reliable at the high end of IRT score continuum, there’s a wider range of other probable scores associated with the calculated score.

AND (yes, there’s more) if one insists on a single value summary, IRT is able to accommodate. Taking into account the distribution of the latent variable (see our previous post regarding normality), “marginal reliability” gives the weighted average of the reliability values over all possible scores. To avoid any confusion, “marginal” in this context means “on the edge” (that is, averaged over the range) and is not an evaluative judgment. This single-number marginal reliability value is obviously less informative than the full TIF/reliability plot, but it satisfies people wanting an alpha-like value.

And thus ends our fascinating discussion of reliability.