Let’s talk about DIF, baby!

Let’s talk about you and me (and our respective defining group memberships), let’s talk about all the good things (obtaining accurate scores) and the bad things (model misspecification) that may be. Let’s talk about DIF.

Differential item functioning (DIF) is the psychometric jargon term for when items perform differently depending on characteristics of the person answering the item. Analyses for detecting DIF should be part of any initial calibration study of a COA, PRO, or test and are readily available with easily implementation in psychometric software such as flexMIRT®. In more general statistical terms, DIF is a part of the larger idea of measurement invariance, which includes cross-sectional DIF among groups, differences between group characteristics (means, variance), group or time differences in the factorial structure, and changes in how items related to latent variables over time (longitudinal variance/invariance), among other things.

While DIF can exist for any grouping variables, the most commonly investigated are generally demographic characteristics, such as race, age group, gender, SES, etc.  A commonly cited example of DIF in practice are crying items on depression scales – such items are often found to perform differently (have different statistical properties) between males and females, such as the trace lines presented below which are constructed using item parameter values for the specific item “I felt like crying” presented by Teresi et al. (2009).


The depression items in this study were presented with 5 response options (“Never” to “Always” as verbal labels) but for statistical reasons, the top three responses (“Sometimes”, “Often”, and “Always”) were collapsed into a single category leaving 3 response categories for analysis. A comparison of the two trace line plots above finds that while the general shape is the same, the Male plot is shifted to the right, relative to the Female plot (i.e., gender DIF).

The statistical interpretation of these plots is that for Males, agreeing that they have felt like crying indicates a higher level of depression than the same response from a female. Without going in-depth into the social constructions of gender roles, etc., a reasonable substantive interpretation could be that, given the American masculinity norms against being emotionally sensitive (e.g., crying), for a man to admit to crying takes a much higher level of depression than the same admission from woman.

As another example, take an item about biting others such as can be found on a childhood behavior checklist. In a toddler, biting others, while not desirable behavior, is generally not considered a giant red flag. Now consider that same behavior when seen in a teenager or adult (i.e., Age DIF) – totally whack-a-doo, to use the proper clinical term.

The problem with DIF, if left unaccounted for, is that you end up with incorrect scores. Using the above depression item example, a calibration could have been conducted just lumping males and females together and getting a single set of item parameter estimates – these estimates would end up being somewhere in between the estimates we saw for males and the estimates we saw for females and are actually incorrect for use in either/both group(s). By extension, using those “lumped together” item parameter estimates to get scores for individuals will result in inaccurate scores. The more extreme the DIF is in an item and the more items there are with DIF, the more pronounced and problematic the inaccuracy of the scores will become.

When one encounters DIF in practice, there are several possible courses of action, including:  Run screaming for the hills, drop/replace the items, or use different parameters for each group when scoring observations. Dropping the item(s) is the simplest solution, but assumes that there are sufficient items available that the researcher is just able to discard any and all DIF-exhibiting items and not lose an acceptable level of reliability or content validity.

The use of different item parameters for the groups requires additional work. If there are multiple sources of DIF (age, gender, race, SES), the researcher must ensure that sufficient observations are obtained in the initial calibration/DIF study to obtain stable and precise item parameter values for each group and subgroups within groups if DIF is found in an interaction (e.g., gender * race). To correctly score individuals, the data file(s) will need to incorporate all the relevant group memberships in some way, in addition to containing the item responses, and the scoring software will need access to group-specific item parameters (meaning the item bank with parameters becomes more complex). While group-specific scoring may be a simple change in pre-programmed scoring software, such as Adaptest®, with the added level of complexity that finding DIF introduces, it easy to see how running for the hills can become an attractive “solution” to DIF. That is, if you truly want accurate scores for your patients/respondents in the face of DIF then