What Do You Think?
Social Promotion Does Not Work
Jay Greene and Marcus Winters look at this issue using Florida's program as an example
An Evaluation of Florida's Program
to End Social Promotion
Jay P. Greene, Ph.D., Senior Fellow, Manhattan Institute for Policy Research
Marcus A. Winters, Research Associate, Manhattan Institute for Policy Research
Nine states and three of the nation's biggest cities have adopted mandates intended to end "social promotion"-promoting students to the next grade level regardless of their academic proficiency. These policies require students in certain grades to reach a minimum benchmark on a standardized test in order to move on to the next grade. Florida, Texas, and seven other states, as well as the cities of New York, Chicago, and Philadelphia, have adopted mandatory promotion tests; these school systems encompass 30% of all U.S. public-school students. Proponents of such policies claim that students must possess basic skills in order to succeed in higher grades, while opponents argue that holding students back discourages them and only pushes them further behind.
This study uses individual-level data provided by the Florida Department of Education to evaluate the initial effects of Florida's policy requiring students to reach a minimum threshold on the reading portion of the Florida Comprehensive Assessment Test (FCAT) to be promoted to the 4th grade. It examines the gains made in one year on math and reading tests by all Florida 3rd graders in the first cohort subject to the retention policy who scored below the necessary threshold, comparing them to all Florida 3rd graders in the previous year with the same low test scores, for whom the policy was not yet in force. Because some students subject to the policy obtained special exemptions and were promoted, the study also uses an instrumental regression analysis to separately measure the effects of actually being retained. The study measures gains made by students on both the high-stakes FCAT and the Stanford-9, a nationally respected standardized test that is also administered to all Florida students, but with no stakes tied to the results.
The authors intend to follow the same two cohorts of students in future studies to evaluate the effects of this new policy over time. The findings of this study, evaluating Florida's program after its first year, include:
Low-performing students subject to the retention policy made gains in reading greater than those of similar students not subject to the policy by 1.85 percentile points on both the FCAT and the Stanford-9.
Low-performing students subject to the retention policy made gains in math greater than those of similar students not subject to the policy by 4.76 percentile points on the FCAT and 4.43 percentile points on the Stanford-9.
Low-performing students who were actually retained made gains in reading greater than those of similar students who were promoted by 4.10 percentile points on the FCAT and 3.45 percentile points on the Stanford-9.
Low-performing students who were retained made gains in math greater than those of similar students who were promoted by 9.98 percentile points on the FCAT and 9.26 percentile points on the Stanford-9.
School systems across the nation have recently enacted substantial new programs to stop schools from promoting students from grade to grade regardless of academic proficiency. To end this practice, known as "social promotion," several large school systems now require students in particular grades to demonstrate a benchmark level of mastery in basic skills by passing a standardized test before they can be promoted. These controversial mandates have been adopted by nine states, including Florida and Texas, as well as by New York City, Chicago, Philadelphia, and other cities.
Proponents of these programs think that schools do students no favor by promoting them to the next grade if they do not possess the skills necessary to succeed at a higher level. They argue that if a student lacks basic proficiency in reading concepts at the third-grade level, that student will certainly fail to grasp concepts intended for fourth-graders. On this view, once a student is promoted beyond his skills he will only continue to fall further behind as material becomes more difficult in later years.
While these arguments may be plausible, there is currently no research backing them up. Those opposed to ending social promotion, on the other hand, point to a wide body of research suggesting that students who are retained in a grade for an extra year are academically and emotionally harmed by the experience. Several studies have indicated that students who are held back have lower test scores and are more likely to drop out than similar counterparts who are not held back.
However, prior research on grade retention is severely limited by methodological problems that are unavoidable in evaluating retention policies based on subjective criteria (i.e., teachers' evaluations that students should be retained). Furthermore, it is questionable whether research on students who were retained according to subjective criteria is even relevant in the first place to retention policies based on objective criteria. For example, it is possible that the potentially harmful stigma currently associated with retention might not apply to the same extent under the new system, which holds back much larger numbers of students. It is certainly possible that retaining thousands of students according to their standardized test scores might influence student outcomes in ways far different from previous retention practices that singled out a very small number of students for retention based upon subjective criteria. New research looking directly at the effectiveness of test-score mandates intended to end social promotion is necessary in order for policymakers and the public to make informed decisions.
This study seeks to provide research to inform that debate by evaluating Florida's early experience with ending social promotion through standardized testing. Under a law passed by the state legislature, third-graders in Florida must score at the Level 2 benchmark or above on the reading portion of the state's high-stakes test, the Florida Comprehensive Assessment Test (FCAT), in order to be promoted to the fourth grade. Students who fail to reach this benchmark are given supplemental instruction and, unless they acquire an exemption, must repeat the third grade. The third-grade class of 2002–03 was the first affected by the mandate.
Our data were the individual test scores of all third-grade students who failed to reach the minimum benchmark on the FCAT reading test during the 2001–02 and 2002–03 school years. We examined the test-score gains made by students over one year after the point when they failed to reach the benchmark. We included third-graders who missed the benchmark in 2002–03, the year in which the new policy first took effect, as well as third-graders who missed the benchmark in the previous year, when the policy was not yet in effect.
We performed two analyses. Our first analysis measures the effect that being subject to the new program has on student achievement. It compares low-scoring third-graders in 2002–03, who were subject to the program, with low-scoring third-graders from the previous year, who were not. However, some students who were subject to the program received special exemptions that allowed them to advance to the fourth grade in spite of their test scores. Thus, not all students who were subject to the new program were actually retained. So we performed a second analysis in order to measure the effect of actual retention-whether under the new program or under the old retention policy-on low-performing students. We used an instrumental analysis method to compare students in both years who were actually retained with students in both years who were not actually retained.
We find that the early stage of Florida's policy to end social promotion has improved academic proficiency. Our first analysis finds that low-performing students subject to the program made modest improvements in reading and substantial improvements in math compared with those made by low-performing students in the previous year's cohort who were not subject to the program because it had not yet taken effect. Our second analysis finds that the effect of actually being retained is even stronger. We find that low-performing students who were actually retained make relatively large improvements in reading and exceptional improvements in math compared with similarly low-performing students who were promoted.
The findings of this study are encouraging for the use of standardized testing policies to end social promotion, but they are also limited because we are only able to evaluate the effects of the first year of the program. It is certainly possible that the gains made by students affected by the program might not hold up later in their academic careers, as proponents of the policy expect. On the other hand, it is also possible that the gap between students who were socially promoted and those who were retained might widen further as they enter higher grades and the material becomes even more challenging. Further research following these same groups of students will be necessary to track the effectiveness of Florida's retention program over time. For the time being, this study indicates that the use of objective testing to end social promotion leads to substantial academic gains for low-performing students, though we cannot yet determine how long these gains will persist.
That limitation notwithstanding, this study has important implications not only in Florida but nationwide. Since Florida's program is very similar to programs in eight other states, as well as New York, Chicago, and Philadelphia, that also focus on achievement on standardized reading tests, the results of our Florida analysis are likely to indicate the effects of the same policy in these other school systems. Much could be learned by directly measuring those other programs using the method employed in this study, but until such research has been performed, this study provides valuable information that can reasonably be applied to other existing programs and possible future programs as well.
Over the last several decades, many studies have examined the effect of retention on future student achievement. Most of these studies have found that student outcomes are negatively affected by retention. However, the quality of these studies is far lower than their quantity.
Holmes (1989) performed an often-cited meta-analysis of the research on grade retention. A meta-analysis is a study of studies: it analyzes the results of multiple previous studies on a particular research question-in this case, the effect of retaining students on future outcomes. The purpose of a meta-analysis is to empirically produce a single cumulative finding from a wide body of research.
In his meta-analysis, Holmes included 63 studies on grade retention with a total of 861 findings. Of the 63 studies he evaluated, 54 reported overall negative effects on grade retention. Of the studies that directly measured academic achievement, Holmes determined that their cumulative finding was that retained students performed 0.19 standard deviation units below promoted students. He concluded that "the weight of empirical evidence argues against grade retention" (p. 28).
While Holmes's meta-analysis is often treated as definitive, some researchers have pointed out serious flaws with his finding (Reynolds 1992 and Alexander, Entwisle, and Dauber 2003). They argue that the studies in Holmes's meta-analysis are not of high enough quality to support definitive conclusions.
The most serious limitation of previous research on retention is the lack of an adequate control group that can be compared with retained students. Reynolds, whose own study concludes that retention is harmful, points out that "only 25 of the 63 retention studies [included in Holmes's meta-analysis] used matched control group designs (matched prior to data analysis or statistically controlled). Only 16 studies matched students on prior achievement, and only 4 studies matched students on attributes that are consistently found to be predictive of the decision to retain" (p. 102).
So 38 of Holmes's 63 studies did not even have a control group with whose performance retained students were compared. And the other 25 studies, while they do compare retained students with other students, are drawing comparisons with students who are not adequately comparable to retained students.
The problem of finding an adequate control group that can be compared with retained students has not been easy to solve in previous studies. Some past researchers have made great efforts to develop adequate comparison groups, but these efforts have been rendered futile by the subjectivity of grade-retention decisions. In the past, the retention of a student has largely been the result of a teacher's subjective assessment of his ability to succeed at the next level. Therefore, we can expect that students who were retained are fundamentally different from students who were promoted, even if they are similar in all measurable factors such as race, because their teachers evaluated them as being fundamentally different. Further complicating matters is that assessments of students are likely to differ greatly not only among teachers but also among a single teacher's evaluations of various children. In previous studies, the students who were retained are simply not comparable with the promoted students with whom they are compared.
The existence of an objective retention policy in Florida allows for the development of an adequate comparison group not available in previous evaluations. Unlike previous studies, this study compares students who were subject to retention with other students who we know would have been subject to retention had they only been born a year later.
Reynolds also points out that most of the previous studies included in Holmes's meta-analysis have evaluated the effect of retention on white middle-class students in suburban or rural schools (p. 103). Such studies might tell us very little about the effects of retention on urban minority students, whom the new retention policies are most often targeted toward helping and who are, in fact, the most likely to be retained under them.
Previous research supporting retention policies has also suffered from methodological flaws. Many of the studies in Holmes's meta-analysis that find positive effects from retention make within-grade comparisons of students instead of within-age comparisons (see also Alexander, Entwisle, and Dauber 2003). Within-grade studies compare retained students with other students in their cohort grade after they are held back, while within-age evaluations compare retained students with students in their original class who were promoted to the next grade.
Alexander, Entwisle, and Dauber argue that within-grade comparisons are preferable because "comparing repeaters with children who have been exposed to a more advanced curriculum puts them at a decided disadvantage, and a same-age frame of reference almost preordains results that favor promotion" (p. 19). However, the only meaningful way to evaluate the effects of retention is to compare the academic achievement of retained students with an estimate of what their performance would have been had they not been retained. Only a within-age comparison can provide such an evaluation. Within-grade comparisons, even those that evaluate students several years after retention, fail to provide any information about the level at which retained students would have performed had they been promoted.
Even if all this previous research were of high quality, there are strong theoretical reasons to believe that these previous studies, which examine the effectiveness of retention based on subjective criteria, might have little relevance to programs intended to end social promotion by applying objective criteria for retention. If, as several of the researchers who have found retention to be harmful have hypothesized, a retained student performs worse because he feels excluded and thus inferior, then a policy that holds back thousands of students might dilute this sense of being singled out, limiting the psychological harm associated with retention. Also, subjective assessments of students are vulnerable to inappropriate influences, including teachers' prejudices and pressure brought by parents, in ways that objective criteria of performance might limit. Implementing objective standards, even if they are accompanied by subjective exemptions from those standards for some students, might significantly change the effects of retention in ways that previous research cannot anticipate.
Scholars at the Consortium on Chicago School Research have performed a series of evaluations of that city's objective program to end social promotion through testing (Nagoaka and Roderick 2004). Though Chicago's policy now includes several ways that a student can be promoted with low test scores, the cohorts of students examined by Nagoaka and Roderick in the third, sixth, and eighth grades were required to exceed benchmarks on the Iowa Test of Basic Skills (ITBS), a nationally respected and widely administered standardized test, in order to be promoted to the next grade. In the latest of these studies, conducted in 2004, Nagoaka and Roderick compared the performance of third- and sixth-grade students who scored just below the benchmark on the ITBS, most of whom were retained because of the mandate, with the performance of students who scored just above the benchmark, most of whom were promoted. Nagoaka and Roderick conducted two analyses, similar to the two analyses in our study. First, they measured improvements in performance between the two groups without accounting for whether each student was actually retained, and then statistically adjusted for whether each student was retained or promoted. They were able to measure test-score performance for two years after the implementation of the program.
Nagoaka and Roderick's study found that, after two years of the retention policy, third-grade students were not affected and sixth-grade students were negatively affected by the policy in their performance on the ITBS reading test. However, while their study provides valuable evidence on the effectiveness of Chicago's retention program, it is limited by several factors.
The most important limitation of the Nagoaka and Roderick study is that comparing students in the same cohort who score just above and just below the benchmark is an inadequate method. First, we know that the two groups of students are systematically different from each other because they performed differently on the ITBS. This incomparability could fundamentally distort their results. Our study compares all third-grade students affected by the first year of Florida's social promotion mandate with all those in the prior year who would have been subjected to the mandate had it been in force at that time.
Nagoaka and Roderick argue that doing such a comparison in Chicago would have been inappropriate because standardized test scores were rising throughout Chicago from year to year, especially in the grades that they were evaluating. Because fewer students received failing grades in the cohort that was subject to the new policy, they argue, their treatment group would have been biased. It would have been made up of students who still could not pass the benchmark despite the general improvements in Chicago. Unfortunately, the method that they do use incorporates a worse bias. Students from a single year prior to the implementation of the policy would likely be far more comparable with their treatment group than students in the same year who scored above the benchmark that their treatment group failed to reach.
Comparing similarly low-performing students in different cohort years is also preferable because it allows an evaluation of all students affected by the program. Because their study compares only those students just above and just below the ITBS benchmark, Nagoaka and Roderick's evaluation tells us nothing about the policy's effect on students whose scores were far below the necessary benchmark. Such very low-performing students, who made up about 50 percent of their cohort of sixth-graders in 2000, were simply excluded from their analysis.
The Chicago study is also limited because it only evaluated performance on the ITBS reading assessment. Nagoaka and Roderick argue that they used only reading because the vast majority of students were retained because of their reading scores, not their math scores. But it is quite possible, perhaps even probable, that students who were retained because of their reading scores might make larger gains in math than in reading compared with students who were promoted without the necessary skills. This might seem counterintuitive given that such programs are usually portrayed as being targeted to improving literacy, not numeracy. In fact, similar programs elsewhere, such as in Florida, only require students to pass the reading assessment to be promoted. Nonetheless, retained students might make greater academic progress in math because learning in that subject is more cumulative than reading is. For example, if a student does not adequately learn addition, he will be particularly unlikely to adequately learn multiplication, because understanding the latter requires mastery of the former. While learning in reading may have a similar cumulative effect, it is likely not to be as dramatic as for math.
Florida's Program to End Social Promotion
Over the last several years, Florida has attempted to make substantial reforms to its struggling school system, which consistently ranks close to the bottom on nearly all academic indicators. In May 2002, the legislature decided to focus its attention on the problem of social promotion-the practice of promoting students to the next grade level independent of their academic proficiency-at the end of the third-grade year. Those opposed to social promotion argued that students who leave third grade without reaching a certain minimum benchmark of basic skills will fail to adequately grasp the more difficult curriculum of fourth grade. They further argued that the gap between students with and without basic skills would continue to grow as material continued to become progressively harder over time because socially promoted students would lack the foundation on which to build their body of knowledge. They claimed that students who cannot read at a proficient level at the end of third grade would benefit in both the short and long run by retaking the same material again instead of moving to a higher grade with more difficult material.
Florida revised its school code to require third-grade students to score at the Level 2 benchmark or above on the reading portion of the FCAT, which was already used throughout the state as a high-stakes standardized test, in order to be promoted to the fourth grade. By requiring that all students possessed at least the basic proficiency necessary to succeed in the next grade level, the reformers hoped that ending social promotion would lead to great academic gains. The third-grade class of 2002–03 was the first to be affected by the law.
The law allowed for some exceptions to the retention policy. A child who misses the FCAT benchmark can be exempted from the policy and promoted to fourth grade if he meets any one of the following criteria: 1) he is a Limited English Proficiency student who has received less than two years of instruction in an English for Speakers of Other Languages program; 2) he has a disability sufficiently severe that it is deemed inappropriate for him to take the test; 3) he demonstrates proficiency on another standardized test; 4) he demonstrates proficiency through a performance portfolio; 5) he has a disability and has received remediation for more than two years; or 6) he has already been held back for two years. Of third-grade students in 2002–03 who scored below the Level 2 threshold and were thus subject to retention under the new policy, 21.3 percent were reported as having received one of these exemptions.
Florida's policy is similar to those recently enacted by other large school systems. As in Florida, all third-grade students in Texas must pass the reading portion of that state's standardized test to be promoted to the fourth grade. Seven other states have adopted similar policies in various grades. New York City has a similar reading mandate for third-graders and has recently expanded its mandate to require fifth-grade students to pass a standardized test to earn promotion as well. Chicago uses whether students pass the reading and math sections of the Iowa Test of Basic Skills in the third, sixth, and eighth grades as a strong component of retention decisions. Philadelphia is now adopting mandatory exams for promotion in grades three through eight. These nine states and three cities enroll a full 30% of all students in public schools in the U.S.
Our data include low-scoring students from two school years. First, we include all Florida students who entered the third grade for the first time in 2002–03 and scored below the Level 2 threshold on the FCAT reading test in that year. This was the first cohort of students in the state subject to the policy requiring them to pass the FCAT reading test in order to be promoted. Our study includes all students who did not pass the FCAT reading test; however, because exemptions from the new policy were available, many of the students we include were not actually retained. Of the third-graders in 2002–03 included in our study for whom we have necessary test scores for our analysis, 60 percent were actually retained. We also include all students who entered third grade for the first time in the 2001–02 school year and who also scored below Level 2 on the FCAT reading test. These students had test scores that would have made them subject to the new policy's retention mandate had it been in effect in that year. Of third-graders in 2001–02 included in our study for whom we have necessary test scores for our analysis, 8.7 percent were retained. The students from both school years are very similar in all respects except for the year in which they happened to have been born, making comparisons between their improvements particularly meaningful.
We analyzed the one-year test-score gains that students made on state-mandated math and reading tests. The existence of developmental-scale scores on each of the tests allows us to compare the test-score gains of all the students in our study even though they took different tests designed for different grade levels. Developmental-scale scores are designed to measure academic proficiency on a single scale for students of any grade and in any year. For example, a third-grader with a developmental-scale score of 1,000 and a fourth-grader with a developmental-scale score of 1,000 have the same level of academic achievement; if a student gets a developmental-scale score of 1,000 in 2001–02 and gets the same score of 1,000 in 2002–03, this indicates that the student has not made any academic progress in the intervening year.
We analyzed the improvements made by students over one year in math and reading scores on the criterion-referenced as well as norm-referenced versions of the FCAT. Both of these are standardized tests that Florida students are required to take. For purposes of clarity, throughout the rest of this study we will follow widespread practice and refer to the criterion-referenced version of the test as the "FCAT" and the norm-referenced version as the "Stanford-9."
Each year, all Florida students in grades three through ten take both the FCAT and the Stanford-9. All students in grades three through ten take both the math and reading sections of both tests each year; other subjects are also tested intermittently. The reading portion of the FCAT is the test that third-grade students must pass in order to be promoted. There are other high stakes tied to the results of the FCAT as well. Every year, the state grades each school from A to F, based primarily on its students' performance on the FCAT. The Stanford-9 is a highly respected standardized test that is frequently administered by states and school districts across the nation. Florida does not attach meaningful stakes to the results of the Stanford-9, as it does to the FCAT. The Stanford-9 is administered to help parents better understand their children's proficiency levels and has been used by researchers and reporters to check the reliability of the results of the high-stakes FCAT (see Greene, Winters, and Forster 2002 and Harrison 2004).
The existence of the Stanford-9 is particularly helpful for our analysis. Several researchers argue that the results of high-stakes tests like the FCAT are routinely distorted because they create adverse incentives for teachers and school systems to manufacture high scores either by "teaching to the test"-changing curriculum and teaching practices in such a way as to raise test scores without increasing real learning (for example, see Amrein and Berliner 2002, Klein et al. 2000, McNeil and Valenzuela 2000, Haney 2000, and Koretz and Barron 1998)-or by outright cheating (for example, see Cizek 2001, Dewan 1999, Hoff 1999, and Lawton 1996). The absence of any substantial consequences attached to the results of the Stanford-9 helps to remove these concerns for our analysis. Since there are no meaningful stakes tied to it, there is no particular incentive for teachers or school systems to attempt to manipulate its results. Thus, if we find similar results on both the high-stakes FCAT and the low-stakes Stanford-9, we can be confident that our findings indicate improvements in real learning and are not distorted by adverse incentives created by high-stakes testing.
With the cooperation of the Florida Department of Education, we obtained individual student-level test scores on the math and reading sections of the FCAT and Stanford-9 for the entire population of students in the state of Florida who met the necessary criteria to be part of our study. We obtained test-score and demographic information for all students in the state of Florida who first entered third grade in 2001–02 and scored below the Level 2 threshold on the FCAT reading test in that year, as well as for all Florida students who entered the third grade in 2002–03 and scored below Level 2 on the FCAT reading test in that year. The developmental-scale scores required to reach Level 2 on the FCAT reading test were consistent for each year's cohort. For each student in our analysis, we also collected data on race, free or reduced-price school lunch status, and whether the student was considered Limited English Proficient.
We calculated the developmental-scale-score gains on the FCAT and Stanford-9 made in each student's first third-grade year and the following year. For the students affected by the retention policy, we measured the test-score gains they made between the 2002–03 and 2003–04 administrations of the tests. For students who were not affected by the program, we measured their test-score gains between the 2001–02 and 2002–03 administrations of the tests. For students in each group, our calculations of test-score gains were independent of whether the student was administered the third-grade test (indicating that the student was re