Apr 15

The Slippery Slope of Self-Report Outcome Tests

Among the varieties of tests available for psychological diagnosis, treatment planning, and evaluation, changes in government, agency, and organizational policies favor the use of true-false, multiple choice, and rating scale “outcomes” tests.  The shift toward forced-choice, self-report testing appears to have the blessing of research academicians, HMOs, and Federal Agencies. Particularly favored are recent renditions of “Outcome Based Measurement Systems (OMS)” which dovetail with HMO, agency, and government evaluation of healthcare. Guidelines for clinical evaluation of outcomes provided by the American Psychological Association (APA) support OMS (Practice Directorate, 2011).

The value of narrow range, forced-choice testing to these interests is that they address purported areas of distress and “mal-adaptation”. The substitution of brief, self-report “outcome” measures for detailed individual evaluation by an independent professional appears cost-efficient. They are quick, easy to administer, and do not involve rater judgment.

The adoption of forced-choice tests with their pre-defined scales, commits the profession to a world view, a philosophy of human nature and the nature of illness that is narrow, one-dimensional, removed from what it presumes to measure, and belies the complexity of individual differences. Forced-choice, self-report tests places an individual into a Procrustean bed where no meaning can be attributed to individuality except deviation from a norm. Forced-choice tests are closed upon themselves with the range of options predetermined. The test does not fit the person, but the person is fit to the test.

Outcome Measurement Systems (OMS)

The poster child for psychological Outcomes Measurement Systems is PROMIS or “Patient Reported Outcomes Measurement Information System (PROMIS)”  According to the PROMIS website, PROMIS is “funded by the National Institutes of Health (NIH), [and] is a system of highly reliable, valid, flexible, precise, and responsive assessment tools that measure patient–reported health status. (2012a)”

PROMIS assessment tools largely consist of “item banks”. An “item” is a statement about some area of physical or mental function that may be rated by a participant. A PROMIS document explains:

“These calibrated item banks can be used to derive short forms (typically requiring 4-10 items per concept), or computerized adaptive testing (CAT; typically requiring 4-7 items per concept for more precise measurement). Assessments are available for children and adults by self-report (2012b, p.1)”

An example of a PROMIS scale is the “Adult Satisfaction with Participation in Social Roles Short Form” comprised of 4 items (2011, p. 53). Sample items to be endorsed on a five point scale from “not at all” to “very much” are:

  •  I am satisfied with how much work I can do (include work at home).
  • I am satisfied with my ability to do work (include work at home).

These items comprise 50% of the Short Form.

As argued below, the PROMIS website claims that its item banks are “highly reliable, valid, flexible, precise, and responsive” appear highly overstated.

Self-report Outcome Tests’ Poor Validity

Self-report tests are notoriously unreliable, prone to effects of social desirability or falsification. Even among the well-intentioned, self-evaluation may be strikingly inaccurate. Routinely, respondents see themselves as more intelligent or more “average” or more skilled or less attractive than objective measures would indicate. Self-report simply does not pass muster for children or adults as a standard for personality assessment.

A large-scale review of the literature on self-report items conducted by Dunning, Heath, and Suls (2004) provides a benchmark for validity of these measures. In studies where self-report and objective assessments of ability, skill, knowledge, leadership, personal qualities, and interpersonal relationships were reported, average correlations ranged from .10 to .30. While a researcher examining general factors might be satisfied with a correlation of .30, the correlation accounts for just 9 percent of variance. In other words, 91 percent of individual differences remains unexplained by self-report. Individuals are terrible judges of themselves.  The review concluded that strangers are better judges.

The Casual Conversation Standard

If a test yields no more information than a casual conversation, then the test is about as good as a casual conversation. But what if a test yields less than a casual conversation? Consider the PROMIS item cited earlier “I am satisfied with how much work I can do (include work at home),” restated as a conversational question, “Are you satisfied with how much work you can do (include work at home)?” Answers will vary, but respondents likely distinguish work and home and differentiate other variables as well. When I asked this question to a real person, she responded, “At home, I forget about [my] work with the kids, husband, the house, the shopping, and everything else. At work, from the minute I get to work until the minute I leave, I never really get ahead of the curve because the work to do always exceeds my available time. I am satisfied with what’s creative, but abhor the wheel-spinning meetings…Even so, I can’t think of work I would enjoy more….” Perhaps by some decision rule, a rater might “split the difference” and assign a value of “not at all” or “sometimes” or “moderately satisfied”, but the point is that a great deal of information has been lost, and moreover, none of the scale choices adequately characterize her “Adult Satisfaction with Participation in Social Roles” at home or at work. The test format itself provokes pseudo-estimation.

Objective Confusion

A frequent confusion among psychology graduate students is to label all paper-and-pencil tests “objective”. Achievement tests, writing and reading tests, recognition tests, intelligence tests, and the like are “objective” tests because objective criteria exist by which answers may be calibrated. Self-report tests of the types reviewed are not “objective” –they are subjective.

Paper and pencil tests, whether objective or self-report, are described by the same statistical features, such as “standard distribution,” “split-half reliability,” “test-retest reliability”, and so on; but the fact that objective and self-report tests are described by the same statistical terms does not make them the same. Analogous to computer processing, the rule applies “garbage in, garbage out” no matter how sophisticated the information or statistical program.

Self-report Outcome Tests 

Self-report, true-false, multiple-choice, and rating scale outcome tests do not access uniqueness, cohort or cluster group differences, or register social or cultural change. The tests are composed of fixed items derived from theory, features of “mental disorders” identified in sources such as the Diagnostic and Statistical Manual, or empirical boot-strapping.

Forced-choice, self-report scores reflect deviations from a norm. Therefore, individuality is deviation.  Tests constructed in this manner do not admit unique perspectives, experiences, imagery, or thought processes. The tests impose an artificial or contrived framework that does not register alternative constructions. The plurality of self-report, outcome tests caricature respondent individuality and cultural difference. In a sense, nothing new can be discovered by use of such tests, except that persons or groups differ in terms of the imposed framework.

Self-report Tests Locked in Time

Tests of the sort described are locked in time. They are static measures. Items and choices that are created reflect the vernacular and issues of the time that they were written. Schoolboys know that times change and change rapidly. Without frequent revisions, self-report tests are out of touch or slow to adapt to changes in language and society.

These static tests fail to represent cultural differences. This is the case because these tests are formulated to “measure” distribution on an item. They are not really meant to engage the individual person or his context. To attain a facsimile of “culture fairness”, individuals are selected as “representative” of different races or groups, and their responses are added into the overall group. Individual and group differences disappear into an average.

Teach the Test

Adoption of these tests as measures of change due to psychotherapy favors approaches to intervention that “teach the test”. For example, a PROMIS patient-reported outcomes scale for depression requires respondents to rate statements on a five point scale from “never“ to “always”. “Exemplary” items are:

  • “I felt worthless.”
  • “I felt I had nothing to look forward (PROMIS, 2012, p. 43).”

In “teach-the-test” therapies, these “negative” thoughts are disputed and less negative evaluations are suggested to patients. It would be surprising indeed if such “therapies” did not eventuate in ratings changes.  However, just as teaching the test gives a short-term bump to test scores in educational research, but washes out, so also psychological research suggests short-term, “empirically validated” psychotherapies provide a bump immediately following treatment with little continuing improvement over the long-term.

In addition to the problem of easy manipulability, self-report tests possess poor validity as reviewed earlier in this paper. The narrow focus of such outcomes measures imposes conceptual strictures upon diagnosis, case formulation, and treatment. The increasing promotion of these tests by government, agencies, and inclusion by the American Psychological Association (Practice Directorate, 2011) hardly furthers the reputation or practice of clinical psychology.

Comparison to Performance Tests

If paper-and-pencil self-report tests fare so poorly in personality assessment, do traditional performance tests do better? Historically, the traditional personality tests, the Rorschach and Thematic Apperception Test (TAT), have been plagued by variable standards of administration, codification, and ad hoc interpretation. Yet for all the technical issues that have rendered results from these tests problematic, the Rorschach and TAT are not forced-choice, self-report tests.

Because a respondent is not bound to a set of fixed choices, she has greater freedom. A Rorschach inkblot may elicit any percept; the TAT, any outcome; and the Music Apperception Test (MAT), any story voice, set of characters, or narrative. The items are not limited to true or false, multiple choice, and rating scales.

Performance tests permit unique responses which reveal individuality and cohort, cluster, and cultural differences. Respondents are not forced to adhere to an imposed language to characterize their perspective. They can freely report. Respondents are not requested to self-report with its conundrums.

Respondents are observed and their words and behavior recorded by an examiner. Cohort, cluster, and cultural differences are rendered visible. Within the scope of the test, an individual’s rich verbal and behavioral response is available. Information unique to person, cohort, and cluster group may be gained and new theoretical and dynamic perspectives formed.


Dunning, David, Heath, C. & Suls, J. (2004).  Flawed self-assessment:  Implications for health, education, and the workplace.  Psychological Science in the Public Interest5, 69-106.

Lilenfeld, S.O. (2012). Public skepticism of psychology. American Psychologist67 (2), 111-129.

Practice Directorate. (2011). Practice OUTCOMES Database Overview (Ver 2.0). Washington, D.C.: American Psychological Association.

National Institute of Health (2011). PROMIS SCORING GUIDE. Washington, DC

PROMIS. (2012a). PROMIS.  http://www.nihpromis.org/

National Institute of Health. (2012b). PROMIS Instruments Available for Use. Washington, DC

Contributed by Leland van den Daele