• Promoting a Healthier Workforce

    Health and Productivity Management

    landing-header-kc-hpm1_37748_17213
  • Health and Productivity Management Center

    CHALLENGES OF PRODUCTIVITY INSTRUMENTS

    General Comments

    Scale development is one of the most demanding tasks in social science. It requires both painstaking attention to detail as well as awareness of the "big picture" context of society and culture. Scale-development necessarily takes place in a specific culture and society. To be relevant to the people being measured, the items in the scale must be relevant to the culture and society of these people. At the same time, a scale which is too strongly rooted in a particular culture is not able to adapt and change along with that culture. Where several cultures overlap, scale developers must work especially hard to avoid bias favoring one culture and against another.

    Researchers who do the best work commit many years of effort to developing and validating their instruments. And once their instrument begins to come into wide use, they realize they are trapped on a treadmill of their own making: either they keep working with the scale to maintain its currency, or they allow those years of effort to pass into history. The instrument will remain a benchmark, but a benchmark of a culture that is in the past. To be sure, the active time frame might span 20-30 years, but instruments require constant attention and hard work to remain relevant and current. 

    Further, once researchers have invested many years into developing an instrument, they need to find a way to recover some small portion of the extraordinary investment they have made. But this is also a treadmill. To recover the investment, they have to find more and broader opportunities for using their instrument. These new opportunities also present new and more complex demands on the instrument. Thus the researcher is working just as hard as ever to keep the scale relevant and up to date. 

    One way that some researchers try to control the time and energy investment is to make their instruments proprietary. This can be a very valuable short-term solution to the problem of investment. But it also means that the instrument held in secret is not open to the level of critique and feedback that will accrue to instruments that have open-use, e.g. instruments that rely on copyright protection and the honesty of clients who use the work. The problem for the open-use researchers is that unscrupulous researchers do re-work and re sell the open-use instruments and take advantage of the original investment with no return to the originators or to science in general. (It is ironic that people who do NOT appreciate the work that must go into developing a good measure are the ones who are most unwilling to pay for the use of the instrument.) Researchers and others who understand how much work is required of an instrument have no objection to the modest fees charged for most instruments. 

    From the point of view of science - as opposed to the viewpoint of scientists who have to eat and support their families - open-use instruments provide an important advantage because the work can be exchanged and debated, re-cast, and re-developed. In general, it seems likely that open-use instruments will maintain cultural relevance more easily than closed, proprietary instruments. There are certainly exceptions, and the point is not to judge the quality of individual instruments on the basis of whether the instrument is open or closed. The main point of this foreword is to point out that every instrument has a long and detailed history of development. To truly understand what the instrument measures, it is important to understand the context and history of its development and current status. 

    Observations on Productivity Instruments

    The purpose of this section is to begin to pull together observations that may apply to most or all of productivity instruments.

    Perhaps more important from the point of view of science, there has been no standardization on the number of participants, their occupations, their health problems if any, and the severity of any health problems that may have been selected for analysis. Comparisons of instruments across populations will still be made. But in the absence of direct comparisons, i.e. results on several instruments from the same sample - literally the same people responding to all the questions on all the instruments under review - it is not possible to offer more than a general comparison of the currently available instruments. To be sure, there is greater confidence in instruments reporting results from many different populations in large numbers, but the "gold standard" test remains a direct comparison using the same participants for all the instruments in question. Such a test is not available. 

    Next, how do we determine that an instrument is valid and reliable? What are the standards of science that are used to make such a determination? The short answer is that there is no single, correct answer. There is no sharp dividing line that distinguishes a valid instrument from an invalid one, a reliable instrument from an unreliable one. There are indicators of validity and reliability that denote strengths and weaknesses in instruments. But there can be no unequivocal determination of validity or reliability. Further, even when the indicators of validity and reliability are very good, there is always the outstanding question of similar performance in a new population, at a future time. 

    According to Shadish, Cook and Campbell (2002), a text that summarizes fifty years of work in social science methodology, reliability is primarily a question of the consistency of a measurement tool. The physical sciences hardly ever discuss reliability of their instruments; for example, who questions a thermometer's reliability? Once the thermometer is calibrated, it tells the temperature, right? In self report instruments, reliability is a big concern. The most common standard for reliability is internal consistency, measured on groups of three or more highly similar items (i.e. a factor), and reported as Cronbach's alpha. However, test re test reliability, and other measures are also used. The main point is that reliability is a judgment about an instrument. 

    An alpha at 0.9 or above, on six or fewer items is excellent (Nunnally, 1978). As the number of items in each factor or scale increases, it becomes easier to achieve a 0.9 alpha. For example, a factor of ten or more items must be examined carefully regardless of how high the alpha may be. There are no absolute cut points for the alpha reliability coefficient because it is so sensitive to the number of items in a given scale. However, alphas between 0.8 and 0.9 are relatively good, between 0.7 and 0.8 need improvement, and below 0.7 are valuable for research and further development but not particularly useful as a reliable indicator. But, for example, an alpha of 0.82 on three-four items is almost certainly better than 0.88 on nine-ten items. (See Nunnally, 1978, and Ghiselli, et al., 1981, for an in-depth consideration of reliability.) 

    Validity concerns both an instrument and the entire experimental, social, and cultural circumstances as well as the time-frame in which an instrument is administered. So validity for an instrument always asks cultural and historical questions. If we say that an instrument is valid, what we are really saying is that it has been shown to have certain good characteristics of validity relevant to a given population, culture and time. The quality of the scientific method is also an issue for validity. An instrument must be both constructed and used correctly in order to meet good standards of validity. Shadish, Cook and Campbell (2002), discuss statistical, internal, construct, and external validity. A good instrument must have no fatal flaws and must show reasonable results on all four types of validity. But, once again, this is not a simple yes-no question. Every instrument has its strengths and weaknesses. 

    For the purposes of the review in the JOEM article (Loeppke, et al., 2003), the focus was on questions of validity of the instrument: 1) independent of all other factors, and 2) in context of health impairment, work, and other relevant experiences of participants. The list of analyses requested for this review can be divided into these two categories: 1) the performance of individual items, and factors or scales in the instrument, irrespective of any characteristics of the sample; 2) how the instrument works in context: a) does the instrument distinguish between conceptually different issues? b) does it agree with different measures of the same construct? Most of this review has focused on the former question because most of the analyses provided address internal questions. And some of the most important questions raised in this review concern how the instruments perform in context, i.e. what the results mean. Of course, the biggest questions - particularly those concerned with improving the instruments - overlap both areas. 

    Instrument performance:

    Instrument construction is both science and art. There is no one correct way to ask a question or to measure a problem. Finding a good measure is both an intuitive and an empirical task. It is all hard work.

    With respect to the instruments reviewed by the ACOEM Expert Panel, the skewed distributions represent one of the most basic and most important problems to be addressed. The skew in many of the measurements represents a genuine low frequency of occurrence in the sample populations. The problem is that skewed distributions can really mess up any statistical analyses based on such items. Solving this problem likely requires a three step approach: 

    1. Identifying the actual occurrences in the population, e.g. to distinguish impaired individuals from fundamentally unimpaired (and producing skewed results on such items);
    2. Measuring the range of difficulties for the impaired population; and
    3. Measuring the range of difficulties for the unimpaired population, requiring an extreme sensitivity to very minor difficulties.

    The goal would be to produce relatively normal distributions, while acknowledging that the preliminary assessment is a form of screen to measure the incidence or prevalence of a problem in the population. Most of the instruments reviewed tried to find a compromise between sensitivity to minor impairment without asking every question twice. The result is that overall the instruments seem to produce more normal distributions when the population is most impaired. This is not a criticism of the instruments. Rather it points to the need to consider developing assessments that are targeted for minor, temporary irritations, as distinct from chronic health problems. 

    All of the instruments discussed in the JOEM article (Loeppke, et al., 2003) have used a retrospective assessment of impairment. This is just about the only reasonable way to start off such a program of study. However, it is only one way to complete such a study. For a working population experiencing health problems, we can assume that some days are better than others, and some impairments are more severe than others. Asking survey participants to "average" their experiences may be adequate, and it may be easy to produce a number. But without a concrete demonstration of convergent validity using such an approach with a specific instrument, it is a big question and a big gamble. 

    A prospective assessment with a daily diary or some other frequent reporting/measuring technique is a time-consuming but much more effective approach. Participants gradually develop their own self calibration on the assessment scale(s), and measurements may become more reliable. Further, in a well-constructed study, within-subject variation in impairment and work performance is documented and brought into the design. No longer must the researcher be content with between subject variation only, or with a point prevalence "snapshot" measure. Some of the items in the instruments under review may produce shaky or indeterminate results when asked only one time, but might produce highly reliable and usable measurements if taken once or even several times per day for several weeks. 

    The assessment of productivity is one of the questions that concerns both performance of the instrument and the context in which the instrument is embedded. With respect to instrument performance, we must ask if a person is capable of making such a complex determination, or more properly, when our subjects give us a number, what does that number really mean. As noted above, the assessment of degraded productivity as a result of health impairment is asking several questions at once:

    1. an assessment of productivity for a given period of time, where productivity includes:
      1. workload
      2. effective actions to deal with the workload
      3. ineffective actions in dealing with the work demands
      4. workplace distractions
      5. accomplishments, whether measurable or not
    2. An assessment of health impairment, including
      1. the type of health impairment
      2. the severity of the impairment
      3. the varying impact of the health problem on work effectiveness
      4. the consequences of the impairment in delaying accomplishments
    3. An assessment that balances time and quality, including
      1. typical range of time required to produce a "unit" of work
      2. typical range of acceptable quality of a "unit" of work
      3. reduction in quality or extension of time when impaired
    4. Participants assess on a 0-10 (11-point) scale, an ordinal scale: no showing that participants can rate accurately
    5. Potential influence of actor-observer bias
    6. Other, unmeasured workplace factors

    When the presumed decrement in productivity is asked as a single question, analyzed as a percent of normal productivity, and then multiplied against salary, the result is a number that is equated to "dollars lost due to the impairment." Maybe so. However, without some independent confirmation, such a figure is wishful thinking more than grounded research. 

    First, the rating scale is scored by subjects. There is no showing that the subject workers have any ability or skill to rationally and accurately assess decrements in productivity to +/- 10% (or less, in some cases). Try this: every day you leave home to go to work, and every day you leave work to go home, write down what you think the outside temperature is. Then check your estimate against a thermometer, and write that down as well. Use the same thermometer(s), in the same location(s). Do this every day for several months. After a few days, you may find yourself able to guess the temperature to within a few degrees. But notice how your own states of activity, health, and fatigue can skew your estimate. What you will quickly realize is that you first must assess your own body temperature and current metabolic rate before you can consider what the outside temperature is. This simple experiment asks only for the outside temperature. How much more difficult is it to assess one's overall productivity? 

    Second, there is no showing that the assessment metric remains consistent throughout the entire range of the scale. Without convergent measures, the very best that can be made of this sort of rating scale is an ordinal ranking, e.g. "8" is less than "9" and more than "7." There is no showing that the difference between 7 and 8 is equivalent to the difference between 8 and 9. Yet, this is a fundamental assumption when researchers treat such a score as a ratio scale. In short, taking a single, 0-11 productivity rating and treating it as a percent score is a violation of assumptions and a threat to statistical conclusion validity (Shadish, Cook, and Campbell, 2002).

    Third, without independent, i.e. convergent assessments on a variety of other productivity measures, there is no showing that such measures are actually assessing productivity. The list of questions above describe some of the major issues to be assessed and correlated with a measure of productivity. If convergence were demonstrated, then there would be some basis on which to assess the construct validity of this type of measure. Without good convergence from other measures, we have a single indicator of something that may or may not relate to actual productivity. 

    Fourth, the many topics that are relevant to an assessment of productivity suggest that a multi-dimensional set of measures, targeting related constructs within the broad category of productivity would produce a more comprehensive as well as comprehensible assessment and would improve the construct validity of the productivity measures. 

    Based on the preceding points, results from single global productivity measures may be ranked in orders of magnitude, i.e. larger and smaller, but should not be treated as point percentage estimates. 

    One must also consider the phenomenon in attribution theory in social psychology titled "actor observer bias" (e.g. Sears, Peplau, and Taylor, 1991). The question here is the overall ability of an individual to provide a balanced judgment regarding his/her performance or productivity - especially when the particular assessment in question is degree of decrement in performance. Work performance is a sensitive topic for anyone to respond to. Even when confidentiality is assured, participants are likely to be cautious about how they respond. And they are likely to be unresponsive if there is an implication of blame for failure attributed back to the respondents or to their health problems. 

    Overcoming the numerous problems inherent in assessing personal productivity is an empirical question. It is likely that such assessments will be more like judgments and less like emotional reactions (i.e. sentiments, Nunnally, 1978) when there is an emphasis on:

    1. Positive productivity assessments;
    2. Environmental problems and distractions;
    3. Frequently repeated assessments of productivity over several weeks or months;
    4. Multiple dimensions of productivity included in the assessment;
    5. Problem solving, not victim-blaming; and
    6. A perspective that suggests that everyone has health problems that must be accommodated in the workplace.

    As can be seen, the assessment of productivity is one of measurement in context. We turn to the contextual issues.

    Contexts of measurement:

    The literature details some of the problems inherent in assessing productivity. One very important component in the assessment of productivity is workload. Workload itself is a multi-dimensional set of constructs. The level of physical, cognitive, emotional, and related demands at work all contribute to overall workload. The degree of control that workers have over various demands, the levels of instrumental and emotional social support, the short and long term demands of bosses, and the overall culture at work all contribute to an employee's determination of his/her level of productivity. Except where the workload never varies, it is necessary to monitor changes in the workload.

    Overall:

    In taking note of the preceding criticisms of the problems inherent in a single measure of productivity, it is important to resist the tendency to rate instruments under a single rubric of "science" or "validity." Each productivity instrument has strengths and weaknesses. The important question to ask is what efforts are underway to eliminate the weaknesses and to improve the overall quality of the instrument. There is no perfect solution to measuring productivity, whether or not in the context of health impairment; what is important is an effort of continuous improvement of the measures. Researchers considering using one or more of the currently available instruments should be prepared to conduct their own confirmatory analyses.

    Loeppke R, Hymel PA, Lofland JH, Pizzi LT, Konicki DL, Anstadt GW, Baase C, Fortuna J, Scharf T. Health- related workplace productivity measurement: General and migraine-specific recommendations from the ACOEM Expert Panel. J Occup Environ Med. 2003; 45: 349-359.

    Shadish WR, Cook TD, Campbell DT. (2002). Experimental and quasi-experimental design for generalized causal inference. Boston, Houghton, Mifflin.