That other can o’ worms

Last time, we had a riveting discussion about reliability and mentioned that validity is also something to consider when talking about tests/assessments/scales and the scores that come from them. So let’s do that.

Previously, we used

Score = truth + error

to talk about measurement and the idea of reliability. That is, the more garbage (error) there is in your scores, the less reliable they will be. If your scores are full are noise, then your boat is already sunk. Without reliability, you can’t have validity. But let’s assume that we’re able to get reliable scores and forge ahead to talk about validity. We can expand on the above equation to talk about validity, by further breaking down the “error” part. Specifically,

Error = random error + systematic error

and substitution leads to

Score = truth + random error + systematic error

While I wasn’t specific earlier, the “noise” is random error that will affect the reliability. Systematic error is the new addition here and that relates to validity.

Validity is a rather broad term that relates the idea of ensuring you’re measuring the right thing; specifically, that you’ve eliminated sources of systematic (predictable) error from your scores. This can mean things like math scores being influenced reading level on story problems, the construct of anxiety seeping into your depression measure, or a scale (intended for weight) actually measuring body temperature.

As noted, validity is a rather broad concept, so let’s break it down a little bit. Content validity refers to the idea that everything that should be included is and things that shouldn’t be aren’t. This type of validity also includes face validity, which can be thought of as the smell-test of validity. That is, reading the items, does this anxiety scale looks like it measures anxiety? Construct validity is extent to which the scores from a scale represent the construct you’re attempting to measure. This includes subtypes of validity such as convergent and discriminant validity. Say, when scores from a math test correlate with scores from other math tests (convergent), but correlate at a much lower level with scores from an English test (discriminant). Criterion validity is the idea that that our measure is useful for explaining or predicting other variables of interest. Criterion validity is the validity sub-type I’m primarily soap-boxing about here, with the take-home message being,


Criterion validity can be summed up in the idea, “Do these scores do what they should?” You made your own SAT-like test but can you show that the scores actually predict college success (graduating on time, GPA, etc.)? This is not unrelated to the FDA demand for proof that a scale is “fit for purpose.” A drug company creates a disease-specific quality of life scale. Neato, but can the scores be used to predict symptom improvement or vice versa?

There are a ton of scales, metrics, measures, and assessments that have been published. In such articles, it’s standard operating procedure to report Cronbach’s alpha and maybe show factor analysis model fit values. Alpha serves to establish reliability, the pre-req for validity (although see the previous post), and the model fit provides some evidence for construct validity. Where validity evidence can be lacking in initial publications is in the realm of the actual usefulness of the scale – that is, the criterion validity.  That’s not to say that researchers have to solve all the world’s problems in a single article, but when you’re considering some fancy new scale for use in your research, demand something beyond just a high alpha value (again, see our previous post) and a non-stinky smell-test. Look for studies where the scores from the scale predicted an outcome. Look for follow-up studies that include appropriate control variables and still find meaningful effects.

My colleague James McGinley (HI JIM!) has worked hard on developing a set of metrics that predict NFL performance for draft-eligible players, which you can read about over on the VPG sports blog. In addition to passing the smell-test of “Do good players get high scores?” he’s reporting reliability values for some of the groupings he’s developed (which he labels “Confidence”) and he’s also doing analyses to show that the scores he’s come up with predict useful stats of current NFL players, like career length, Pro Bowl appearances, or accumulated yards for running backs. If you’re evaluating scales or metrics, this is the kind of information that you need to determine if the scores you get from a scale are valid.

If you’re creating your own scale or metrics, then you need to establish, in addition to looking like it does something (content/face validity), that your scale actually does DO something useful. In short, if you’re developing a scale, be like Jim.