|
November 30, 2004JUDGING THE QUALITY OF K-12 MATHEMATICS EVALUATIONSOn Evaluating Curricular Effectiveness: Judging the Quality of K-12 Mathematics Evaluations (2004) Under the auspices of the National Research Council, this committee�s charge was to evaluate the quality of the evaluations of the 13 mathematics curriculum materials supported by the National Science Foundation (NSF) (an estimated $93 million) and 6 of the commercially generated mathematics curriculum materials (listing in Chapter 2).
Under the auspices of the National Research Council, this committee�s charge was to evaluate the quality of the evaluations of the 13 mathematics curriculum materials supported by the National Science Foundation (NSF) (an estimated $93 million) and 6 of the commercially generated mathematics curriculum materials (listing in Chapter 2). The committee emphasizes that it was not charged with and therefore did not: ASSESSMENT OF EXISTING STUDIES Content analyses focus almost exclusively on examining the content of curriculum materials; these analyses usually rely on expert review and judgments about such things as accuracy, depth of coverage, or the logical sequencing of topics. For the 36 studies classified as content analyses, the committee drew on the perspectives of eight prominent mathematicians and mathematics educators, in addition to applying the criteria of requiring full reviews of at least one year of curricular material. All 36 studies of this type were retained for further analysis by the committee. Comparative studies involve the selection of pertinent variables on which to compare two or more curricula and their effects on student learning over significant time periods. For the 95 comparative studies, the committee stipulated that they had to be �at least minimally methodologically adequate,� which required that a study: � Include quantifiably measurable outcomes such as test scores, responses to specified cognitive tasks of mathematical reasoning, performance evaluations, grades, and subsequent course taking; and Case studies focus on documenting how program theories and components of a particular curriculum play out in a particular real-life situation. These studies usually describe in detail the large number of factors that influence implementation of that curriculum in classrooms or schools. For the 45 case studies, 13 were eliminated leaving 32 that met our standards of methodological rigor. The committee then had a total of 147 studies that met our minimal criteria for consideration of effectiveness, barely more than 20 percent of the total number of submissions with which we began our work. Seventy-five percent of these studies were related to the curricula supported by the National Science Foundation. The remaining studies concerned commercially supported curricular materials. This inconclusive finding should not be interpreted to mean that these curricula are not effective, but rather that problems with the data and/or study designs prevent confident judgments about their effectiveness. Inconclusive findings such as these do not permit one to determine conclusively whether the programs overall are effective or ineffective. A FRAMEWORK FOR FUTURE EVALUATIONS The committee recommends that individuals or teams charged with curriculum evaluations make use of this framework. The framework has three major components that should be examined in each curriculum evaluation: (1) the program materials and design principles; (2) the quality, extent, and means of curricular implementation; and (3) the quality, breadth, type, and distribution of outcomes of student learning over time. The quality of an evaluation depends on how well it connects these components into a research design and measurement of constructs and carries out a chain of reasoning, evidence, and argument to show the effects of curricular use. ESTABLISHING CURRICULAR EFFECTIVENESS Defining scientific validity for individual studies is an essential element of assuring valid data about curricular effectiveness. However, curricular effectiveness cannot be established by a single scientifically valid study; instead a body of studies is needed, which is the second key aspect of determining effectiveness. Curricular effectiveness is an integrated judgment based on interpretation of a number of scientifically valid evaluations that combine social values, empirical evidence, and theoretical rationales. This conclusion�that multiple methods of evaluation strengthen the determination of effectiveness�led the committee to recommend that a curricular program�s effectiveness should be ascertained through the use of multiple methods of evaluation, each of which is a scientifically valid study. Periodic synthesis of the results across evaluation studies should also be conducted. This is a general principle for the conduct of evaluations in recognition that curricular effectiveness is an integrated judgment, continually evolving, and based on scientifically valid evaluations. The committee further recognized, however, that agencies, curriculum developers, and evaluators need an explicit standard by which to decide when federally funded curricula (or curricula from other sources whose adoption and use may be supported by federal monies) can be considered effective enough to adopt. The committee proposes a rigorous standard to which programs should be held to be scientifically established as effective. In this standard, the committee recommends that a curricular program be designated as scientifically established as effective only when it includes a collection of scientifically valid evaluation studies addressing its effectiveness that establish that an implemented curricular program produces valid improvements in learning for students, and when it can convincingly demonstrate that these improvements are due to the curricular intervention. The collection of studies should use a combination of methodologies that meet these specified criteria: (1) content analyses by at least two qualified experts (a Ph.D.-level mathematical scientist and a Ph.D.-level mathematics educator) (required); (2) comparative studies using experimental or quasiexperimental designs, identifying the comparative curriculum (required); (3) one or more case studies to investigate the relationships among the implementation of the curricular program and the program components (highly desirable); and (4) a final report, to be made publicly available, should link the analyses, specify what they convey about the effectiveness of the curriculum, and stipulate the extent to which the program�s effectiveness can be generalized (required). This standard relies on the primary methodologies identified in our review, but we acknowledge the possibility of other configurations, provided they draw on the framework and the definition of scientifically valid studies and include careful review and synthesis of existing evaluations. In its review, the committee became concerned about the lack of independence of some of the evaluators conducting the studies; in too many cases, individuals who developed a particular curriculum were also members of the evaluation team, which raised questions about the credibility of the evaluation results. Thus, to ensure the independence and impartiality of evaluations of effectiveness, the committee also recommends that summative evaluations be conducted by independent evaluation teams with no membership by authors of the curriculum materials or persons under their supervision. Representativeness. Evaluations of curricular effectiveness should be conducted with students that represent the appropriate sampling of all intended audiences. Documentation of implementation. Evaluations should present evidence that provides reliable and valid indicators of the extent, quality, and type of the implementation of the materials. At a minimum, there should be documentation of the extent of coverage of curricular material (what some investigators referred to as �opportunity to learn�) and of the extent and type of professional development provided. Curricular validity of measures. A minimum of one of the outcome measures used to determine curricular effectiveness should possess demonstrated curricular validity. It should comprehensively sample the curricular objectives in the course, validly measure the content within those objectives, ensure that teaching to the test (rather than the curriculum) is not feasible or likely to confound the results, and be sensitive to curricular changes. Multiple student outcome measures. Multiple forms of student outcomes should be used to assess the effectiveness of a curricular program. Measures should consider persistence in course taking, drop-out or failure rates, as well as multiple measures of a variety of the cognitive skills and concepts associated with mathematics learning. Content analyses. A content analysis should clearly indicate the extent to which it addresses the following three dimensions: In considering these dimensions, specific evidence of each should be provided to support their judgments. A content analysis should be acknowledged as a connoisseurial assessment and should include identified credentials and statements of preference and bias of the evaluators. Comparative analyses. As a result of our study of the set of 63 at least minimally methodologically adequate comparative analyses, the committee recommends that in the conduct of all comparative studies, explicit attention be given to the following criteria: � Identify comparative curricula by name; The committee recognized the need to strengthen the conduct of comparative studies in relation to the criteria listed above. It also recognized that much could be learned from the subgroup (n=63) identified as �at least minimally methodologically adequate.� In fields in their infancy, evaluators and researchers must pry apart issues of method from patterns of results. Such a process requires one to subject the studies to alternative interpretation; to test results for sensitivity or robustness to changes in design; to tease out among the myriad of variables, the ones most likely to produce, interfere with, suppress, modify, and interact with the outcomes; and to build on results of previous studies. To fulfill the charge to inform the conduct of future studies, in Chapter 5 the committee designed and conducted methods to test the patterns of results under varying conditions, and to determine which patterns were persistent or ephemeral. We used these analyses as a baseline to investigate the question, Does the application of increasing standards of rigor have a systemic effect on the results? In doing so, we report the patterns of results separately for evaluations of NSF-supported and commercially generated programs because NSF-supported programs had a common set of design specifications including consistency with the National Council of Teachers of Mathematics (NCTM) Standards, reliance on manipulatives, drawing topics from statistics, geometry, algebra and functions, and discrete mathematics at each grade level, and strong use of calculators and computers. The commercially supported curricula sampled in our studies varied in their use of these curricular approaches; further subdivisions of these evaluations are also presented in the report. The differences in the specifications of the two groups of programs make their evaluative procedures and hence the validation of those procedures so unlike each other, that combining them into a single category could be misleading. One approach taken was to filter studies by separating those that met a particular criterion of rigor from those that did not, and to study the effects of that filter on the pattern of results as quantified across outcome measures into the proportion of findings that were positive, negative, or indeterminate (no significant difference). First, we found that on average the evaluations of the NSF-supported curricula (n=46) in this subgroup had reported stronger patterns of outcomes in favor of the experimental curricula than had the evaluations of commercially generated curricula (n=17). Again we emphasize that due to our call for increased methodological rigor and the use of multiple methods, this result is not sufficient to establish the curricular effectiveness of these programs as a whole with adequate certainty. However, this result does provide a testable hypothesis, a starting point for others to examine, critique, and undertake further studies to confirm or disconfirm. Then, after applying the criteria listed above, we found that the comparative studies of both NSF-supported and commercially generated curricula that had used the more rigorous criteria never produced contrary conclusions about curricular effectiveness (compared with less rigorous methods). Furthermore, when the use of more rigorous criteria did lead to significantly different results, these results tended to show weaker findings about curricular effects on student learning. Hence, this investigation reinforced the importance of methodological rigor in drawing appropriate inferences of curricular effectiveness. Case studies. Case studies should meet the following criteria: The committee recognizes the value of diverse curricular options and finds continuing experimentation in curriculum development to be essential, especially in light of changes in the conduct and use of mathematics and technology. However, it should be accompanied by rigorous efforts to improve our conduct of evaluation studies, strengthening the results by learning from previous efforts. RECOMMENDATIONS TO FEDERAL AGENCIES, STATE AND LOCAL DISTRICTS AND SCHOOLS, AND PUBLISHERS At the federal level, such actions include:
Subscribe to this site via RSS/Atom: Newsletter signup | Send us your ideas |