Using the Binomial Effect Size Display (BESD) to Present the Magnitude of Effect Sizes to the Evaluation Audience

Justus Randolph

Recent years have witnessed a growing number of published reports that point out the need for reporting various effect size estimates in the context of null hypothesis testing (H0) as a response to a tendency for reporting tests of statistical significance only, with less attention on other important aspects of statistical analysis. In the face of considerable changes over the past several years, neglect to report effect size estimates may be noted in such fields as medical science, psychology, applied linguistics, or pedagogy. Nor have sport sciences managed to totally escape the grips of this suboptimal practice: here statistical analyses in even some of the current research reports do not go much further than computing p-values. The p-value, however, is not meant to provide information on the actual strength of the relationship between variables, and does not allow the researcher to determine the effect of one variable on another. Effect size measures serve this purpose well. While the number of reports containing statistical estimates of effect sizes calculated after applying parametric tests is steadily increasing, reporting effect sizes with non-parametric tests is still very rare. Hence, the main objectives of this contribution are to promote various effect size measures in sport sciences through, once again, bringing to the readers’ attention the benefits of reporting them, and to present examples of such estimates with a greater focus on those that can be calculated for non-parametric tests.

Effect size is a statistical concept which measures the strength of the relationship between two variables. Effect size has basic properties such as measurement unit independence, sample size independence, and monotonicity. In particular, unlike statistical significance test, effect size is not influenced by sample size. Effect size avoids various problems in statistical significance tests. It is one of the important contents in constructing new statistics. In present study, various effect sizes were mathematically described and a free desktop calculator for effect sizes was presented. Calculator and user manual guide can be freely downloaded at : http://www.iaees.org/publications/journals/ces/articles/2023-13(4)/e-suppl/Zhang-Supplementary-Materials.rar ; Details can be found at: http://www.iaees.org/publications/journals/ces/articles/2023-13(4)/4-Zhang-Abstract.asp ; The calculator is subject to any possible updates in the future.

A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to the Practical Assessment, Research & Evaluation. Permission is granted to distribute this article for nonprofit, educational purposes if it is copied in its entirety and the journal is credited. Volume 10 Number 14, September 2005 ISSN 1531-7714 Using the Binomial Effect Size Display (BESD) to Present the Magnitude of Effect Sizes to the Evaluation Audience Justus J. Randolph, University of Joensuu, Finland & R. Shawn Edmondson, Utah State University The use of Rosenthal’s Binomial Effect Size Display (BESD) as a tool for reporting the magnitude of effect sizes to the evaluation audience is discussed. The authors give an overview of the BESD, describe how it is calculated, and present a review of its strengths and weaknesses. Additionally, suggestions for the appropriate use of the BESD are given. An effect size to BESD conversion table is included. Effectively communicating evaluation information to clients is a critical part of the evaluation process, yet many evaluators fail to give it careful consideration (Worthen, Sanders, & Fitzpatrick, 1997). The value of evaluation information, regardless of how scientifically defensible it is, can be undermined when it is not presented in language which is practical and easily understood by the intended audience. Weiss and Bucuvalas (as cited in Worthen et al., 1997) suggest that there are two important types of values to consider to ensure that evaluation information is useful to policy makers: Truth value refers to the technical quality of the study and to whether the findings correspond to policy makers’ previous understanding and experience with how the world works (expectations) . . . Utility value refers to the extent to which the study provides explicit and practical direction on matters the policy makers can do something about and challenges the status quo (with new formulations and approaches). (p. 410) One way that evaluators can increase the utility value of evaluation information is to make the statistics that they report more meaningful to the evaluation audience. May (2004) provides three guidelines for making statistics more meaningful: 1. Understandability: the results should be reported in a form that is easily understood by most people by making minimal assumptions about the statistical knowledge of the audience and avoiding statistical jargon. 2. Interpretability: a statistic is interpretable when the metric or unit of measure that it is based upon is familiar or easily explained. 3. Comparability: the reported sizes of the statistics that might be compared can be 2 Practical Assessment Research & Evaluation, Vol 10, No 14 Randolph & Edmondson, Presenting BESD compared directly, without any further manipulation. (pp. 527-528) The r-based binomial effect size display (BESD) is particularly strong in terms of understandability and, to a lesser extent, interpretability. Because of the understandability and interpretability of the BESD, we argue that it can help improve the utility value of evaluation. In the next section of this article we describe the BESD and its calculations. We then give an overview of its benefits and disadvantages. We end with our suggestions for presenting the BESD to the evaluation audience. THE BINOMIAL EFFECT SIZE DISPLAY Although the coefficient of determination (r2) is often used as an effect size that describes the proportion of variance accounted for in a dependent variable by predictor variables, this technique is problematic. Squaring r can make a practically significant effect appear to be insignificant, especially to a lay audience. The BESD is a tool that may be used to display the practical importance of an effect without relying solely on r or r2 values or other less intuitive effect size measures (Rosenthal & Rubin, 1982). (In this article, by BESD we refer to only the r-based BESD and not to raw-data two-by-two tables. See May [2004] for a discussion on making raw-data two-bytwo tables more meaningful.) The BESD illustrates the practical importance of an effect by displaying a point-biserial r as a twoby-two contingency table. It presents the correlation simply and intuitively as the difference in outcome rates between experimental and control groups. The rows in a BESD table display the independent variable as a dichotomous predictor, such as belonging to an experimental or control group. The columns in the table display the dependent variable as a dichotomous outcome, such as improved and not improved. The row and column totals always add up to 100. Table 1 presents several different examples of BESDs along with their associated correlations. As Table 1 illustrates, the effect size of a metaanalysis of psychotherapy interventions was reported to be r = .32. The BESD shows a psychotherapy success rate of 66% and a control group success rate of 34%. As Rosenthal, Rosnow, and Rubin (2000) state, “ . . . an r of .32 (or an r2 of .10) will amount to a difference between rates of improvement of 34% and 66% if half the population received psychotherapy and half did not, and if half the population improved and half did not” (p.17). If psychotherapy had no effect, then each cell in the BESD table would have been 50%. The values in the BESD should be interpreted as “standardized” percentages, where the percentages within the cells have been set so that all margins are equal. By adjusting the percentages in each of the four cells of a two-by-two table so that row and column margins are equal, the BESD maximizes the symmetry of the cells. It is also important to recognize that the BESD assumes a 50% base-rate for both the experimental and control groups - an artificial situation created to illustrate the impact of the effect. CALCULATING THE BESD Creating a BESD for two groups with equal nsize and with homogenous variances is straightforward. To calculate the success rate of the treatment group, the formula, (.50 + r/2), is used. To calculate the success rate of the control group the formula, (.50 – r/2), is used. For example, if the value of r is 0.07, as in the Vietnam service and alcohol use example in Table 1, then the success rate of the treatment group is (.50 + 0.07/2) = 53.5%. The success rate of the control group is (.50 – 0.07/2) = 46.5%. Taking into consideration how cells A, B, C, and D are positioned in Table 2, putting the treatment group success rates into cells A and B and the control group success rates into cells C and D creates a BESD table identical to the Vietnam and alcohol use example in Table 1. 3 Practical Assessment Research & Evaluation, Vol 10, No 14 Randolph & Edmondson, Presenting BESD Table 1: Examples of Binomial Effect Size Displays Measure Variable Vietnam service and alcohol problems (r = .07) 53.5 46.5 100.0 No problem 46.5 53.5 100.0 100 100 200 Death 38.5 61.5 100.0 Survival 61.5 38.5 100.0 100 100 200 Less benefit 34.0 66.0 100.0 Greater benefit 66.0 34.0 100.0 100 100 200 Problem Vietnam veteran Non-Vietnam veteran Total Total AZT in the treatment of AIDS (r = .23) AZT Placebo Total Benefits of psychotherapy (r = .32)a Psychotherapy Control Total Note. AZT = aziothymidine, AIDS = acquired immune deficiency syndrome. From “How are we doing in soft psychology?” by R. Rosenthal, 1990, American Psychologist, 50, p. 776. Copyright 1990 by the American Psychological Association. Reprinted with permission. aThe analogous r for 345 studies of interpersonal expectancy effects was essentially the same (Rosenthal & Rubin, 1978). Below, Strahan (1991) explains how the BESD formulas presented in the preceding paragraph are derived from the phi coefficient formula: By assuming an equal-marginals contingency table, one can solve for any one – hence for all four – of the cell frequencies [in a BESD display] by working backward from the phi coefficient formula. Specifically, setting a + b = c + d = a + c = b + d = 100, it follows algebraically that d = a and b = c = 100 – a, so that from phi = r = (ad – bc)/[(a + b) (c + d) (a + c) (b + d)]1/2, one gets r = a/50 – 1, and a = 50 + 50r. . . . From this it follows that the treatment success rate is .50 + r/2, and the control success rate is .50 – r /2. (p. 1084) Table 2: BESD Template Group Improved Treatment Control Total (A) (C) 100 Didn’t Improve (B) (D) 100 Total 100 100 200 A BESD table can also be calculated from a standardized mean difference effect size (Cohen’s d) using the formula, r = d/√ (d2 + 4), when there are two groups with equal n-size. The BESD table can then be calculated from r using the formulas given above. Rosenthal et al. (2000) is an excellent resource for calculating BESD values for unequal nsizes or for experiments involving more than two groups. Table 3 gives a list of Cohen’s d values (from 0 to 3 in intervals 0.10), their associated r 4 Practical Assessment Research & Evaluation, Vol 10, No 14 Randolph & Edmondson, Presenting BESD values, and BESD values for two groups with equal n-size. Table 3: Effect Size to BESD Conversions BESD Cells Cohen’s r A B C D 0.00 0.00 50.0 50.0 50.0 0.10 0.05 52.5 47.5 47.5 0.20 0.10 55.0 45.0 45.0 0.30 0.15 57.4 42.6 42.6 0.40 0.20 59.8 40.2 40.2 0.50 0.24 62.1 37.9 37.9 0.60 0.29 64.4 35.6 35.6 0.70 0.33 66.5 33.5 33.5 0.80 0.37 68.6 31.4 31.4 0.90 0.41 70.5 29.5 29.5 1.00 0.45 72.4 27.6 27.6 1.10 0.48 74.1 25.9 25.9 1.20 0.51 75.7 24.3 24.3 1.30 0.54 77.2 22.8 22.8 1.40 0.57 78.7 21.3 21.3 1.50 0.60 80.0 20.0 20.0 1.60 0.62 81.2 18.8 18.8 1.70 0.65 82.4 17.6 17.6 1.80 0.67 83.4 16.6 16.6 1.90 0.69 84.4 15.6 15.6 2.00 0.71 85.4 14.6 14.6 2.10 0.72 86.2 13.8 13.8 2.20 0.74 87.0 13.0 13.0 2.30 0.75 87.7 12.3 12.3 2.40 0.77 88.4 11.6 11.6 2.50 0.78 89.0 11.0 11.0 2.60 0.79 89.6 10.4 10.4 2.70 0.80 90.2 9.8 9.8 2.80 0.81 90.7 9.3 9.3 2.90 0.82 91.2 8.8 8.8 3.00 0.83 91.6 8.4 8.4 (Rosenthal, 1990). That is, it has more of what May (2004) calls understandability and interpretability than r, r2, or d. • It is easy to compute. D 50.0 52.5 55.0 57.4 59.8 62.1 64.4 66.5 68.6 70.5 72.4 74.1 75.7 77.2 78.7 80.0 81.2 82.4 83.4 84.4 85.4 86.2 87.0 87.7 88.4 89.0 89.6 90.2 90.7 91.2 91.6 STRENGTHS OF THE BESD Under certain conditions, the BESD has several strengths that make it desirable for reporting the practical significance of effect sizes to lay audiences. Its strengths are listed below: • The BESD is intuitively understood by lay audiences compared to somewhat complicated statistics such as r, r2, or d • It is appropriate for understanding r2 in the context of, as Rosenthal et al. calls them, “the ‘softer, wilder’ areas of the social and behavioral sciences – where the results often seem ephemeral and unreplicable, and where r2 seems always to be too small” (2000, p. 25). • It allows an evaluator to present a two-by-two table to evaluation audiences when there is not enough information to construct a twoby-two table from raw data. This might be the case when it is necessary to illustrate the magnitude of a combined, meta-analytic effect size to an audience who would understand and interpret a two-by-two display much better than a statisticallyloaded effect size, like r, r2, d, or an odds ratio. CRITICISMS OF THE BESD Researchers have criticized the BESD for a number of reasons (Crow, 1991; Strahan; 1991, McGraw, 1991; and Thompson & Schumaker, 1997). Some critics argue that the BESD is a misleading ‘what-if’ technique. Other critics argue that Rosenthal’s BESD distorts results as a function of the symmetry of the raw data of the cells. These criticisms are explained in more detail below. First, although the BESD may be intuitive for lay audiences, McGraw argues that “creating an artificial case that is correlationally equivalent to the original case so distorts the original data that the exercise is terribly misleading” (1991, p. 1084). For example, the lay audience may mistake the ‘standardized’ percentages of the BESD for the actual raw data if the BESD is not carefully explained. What’s more, Strahan (1991) calls the BESD a “what if?” statistical technique, such as analysis of covariance, and therefore has all the faults of “what if?” techniques. Second, Thompson and Schumacker (1997) make the case that as the asymmetry of the BESD Practical Assessment Research & Evaluation, Vol 10, No 14 Randolph & Edmondson, Presenting BESD increases (i.e., the cells diverge further from 50%) the interpretation of the effect becomes more erroneous. For example, when the binary success rate is symmetrical at 50%, the percentage difference between Φ and d is 0 (Φ is the Phi coefficient - the measure of the degree of association between two binary variables.) However, when the binary success rate is 100%, the percentage difference between Φ and d is 100. Rosenthal (1991) responds to these criticisms by saying that there are instances where the Pearson product moment correlation and its equivalents (i.e., the BESD) are not the proper effect sizes to report. Rosenthal concedes that in cases where asymmetry is very pronounced, the relative risk index or the difference in raw proportions are the most appropriate estimators of effect size. However, when holding the value of r constant but changing the symmetry of the cells, Rosenthal has shown that the differences of percentages between cases of minimum symmetry and maximum symmetry (when the BESD is used), vary only slightly. Other measures of effect size like relative risk or odds ratios vary considerably when the value or r for a two-by-two table is held constant but the symmetry of the cells is changed (see Rosenthal, 1991). Rosenthal often emphasizes that the BESD is a standard format for display of the Pearson correlation, and therefore, the propriety of reporting a BESD is conditional upon the propriety of reporting a Pearson correlation. Rosenthal notes that When used appropriately, the BESD has been used to excellent advantage by methodologically sophisticated behavioral researchers and by experienced mathematical statisticians . . . but we [Donald Rubin & Robert Rosenthal] are certainly agreed that the BESD is not the only way to tell how well we are doing in behavioral research. (p. 1087) APPROPRIATE USE OF THE BESD We believe that the BESD is a useful tool for reporting to evaluation audiences in two cases. The first case is when one wants to answer the questions that the BESD is meant to answer – “What would 5 the correlationally equivalent effect of the treatment be if 50% of the participants had the occurrence and 50% did not and 50% received treatment and 50% did not?” This use of the BESD would be valuable when there is a need to illustrate that a value of r or r2 that otherwise appears negligible may have practical significance, as was the case in the Vietnam and alcohol use example in Table 1. However, one must realize that the difference in percentages between a raw data two-by-two table and its corresponding r-based BESD is greatest when the symmetry of the cells in the raw data twoby-two table is lowest. In short, we agree with Rosenthal (1991) that there is no easy answer to which type of effect size to report under each of the various degrees of asymmetry. The second case in which we believe the BESD is a useful tool is when there are not sufficient data to construct a raw-data two-by-two table. This often occurs with the results reported in meta-analyses. The authors of meta-analyses, and others who report on the results of a meta-analyses, often give effect sizes but do not, or cannot, report the aggregate, actual numbers of participants who improved or did not in each condition. Since effect sizes without raw data (e.g., when the actual proportion of the treatment group that improved is not known) abound in meta-analytic reports of research, the BESD can be put to good use to ‘reframe’ those effect sizes in an intuitive binomial display. To use an anecdotal example, one of the authors attempted to report the results of a metaanalysis on after-school research to a group of evaluation stakeholders who were considering planning and evaluating their own after-school program. In order to make their decision whether to implement the program, they wanted to know what kind of results, in terms of academic achievement, they could expect. There were no raw data reported in the meta-analysis; only that d was 0.13 in the after-school direction. After repeated attempts and diagrams to explain the interpretation of a d of 0.13, or an r of .065, the audience still was perplexed. However, after presenting the BESD that corresponds with a d or 0.13 (i.e., 53.5% of students improved in the after-school condition), the evaluation audience seemed to grasp how large an effect a d of .13 actually is. They commented that 6 Practical Assessment Research & Evaluation, Vol 10, No 14 Randolph & Edmondson, Presenting BESD the BESD should have just been shown in the first place. CONCLUSION Personal experience has shown us that statistically-laden effect sizes like r or d can be daunting for many evaluation audiences. However, most audiences seem to have little difficulty understanding percentages presented in two-by-two tables. In the absence of raw data, (e.g., an effect size from a meta-analysis) the BESD can be a useful substitute for showing the hypothetical magnitude of an effect if the assumptions of the BESD are addressed. We agree with Rosenthal that “there is no right answer to which (indicator of effect size) is best or most useful under all conditions” (1991, p. 1086). Given the importance and difficulty of presenting statistically complicated results to lay evaluation audiences, it is useful for evaluators to report statistics, depending on the case, in many ways (e.g., difference in percentages, relative risk, odds ratios, proportions, standardized mean difference effect sizes, and correlational effect sizes). The BESD, when carefully used, is one of many ways that evaluators can put statistics to use to increase the utility value of program evaluation. REFERENCES Crow, E. L. (1991). Response to Rosenthal's comment "How are we doing in soft psychology?". American Psychologist, 46(10), 1083. May, H. (2004). Making statistics more meaningful for policy and research and program evaluation. American Journal of Program Evaluation, 25, 525540. McGraw, K. O. (1991). Problems with the BESD: A comment on Rosenthal's "How are we doing in soft psychology?". American Psychologist, 46(10), 1084-1086. Rosenthal, R. (1990). How are we doing in soft psychology? American Psychologist, 45, 775-777. Rosenthal, R. (1991). Effect sizes: Pearson's correlation, its display via the BESD, and alternative indices. American Psychologist, 46(10), 1086-1087. Rosenthal, R., & Rubin, D. B. (1978). Interpersonal expectancy effects: The first 345 studies. The Behavioral and Brain Sciences, 3, 377-386. Rosenthal, R., & Rubin, D. B. (1982). A simple general purpose display of magnitude and experimental effect. Journal of Educational Psychology, 74, 166-169. Rosenthal, R., Rosnow, R. L., & Rubin, D. B. (2000). Contrasts and effect sizes in behavioral research: A correlational approach. Cambridge: Cambridge University Press. Strahan, R. F. (1991). Remarks on the binary effect size display. American Psychologist, 46(10), 10831084. Thompson, K. N., & Schumacker, R. E. (1997). An evaluation of Rosenthal and Rubin's binomial effect size display. Journal of Educational and Behavioral Statistics, 22(1), 109-117. Worthen, R. B., Sanders, J. R., & Fitzpatrick, J. L. (1997). Program evaluation: Alternative approaches and practical guidelines (2nd ed.). New York: Longman. Note A previous version of this paper was delivered at the Annual Meeting of the American Evaluation Association; Reno, Nevada, November 5th through 8th, 2003. Practical Assessment Research & Evaluation, Vol 10, No 14 Randolph & Edmondson, Presenting BESD 7 Citation Randolph, Justus J. & R. Shawn Edmondson (2005). Using the Binomial Effect Size Display (BESD) to Present the Magnitude of Effect Sizes to the Evaluation Audience. Practical Assessment Research & Evaluation, 10(14). Available online: http://pareonline.net/getvn.asp?v=10&n=14 Authors Justus J. Randolph, Department of Computer Science, University of Joensuu, Finland; R. Shawn Edmondson, Department of Psychology, Utah State University. Correspondence concerning this article should be addressed to Justus J. Randolph, Department of Computer Science, University of Joensuu, PO BOX 111, FIN-80101, Finland. E-mail: justus.randolph@cs.joensuu.fi

RELATED PAPERS

RELATED TOPICS

Log In

Using the Binomial Effect Size Display (BESD) to Present the Magnitude of Effect Sizes to the Evaluation Audience

Using the Binomial Effect Size Display (BESD) to Present the Magnitude of Effect Sizes to the Evaluation Audience

Related Papers

RELATED PAPERS

RELATED TOPICS