Testimony of Robert J. Tobias concerning NYC/NYS 2005 ELA Test Scores

Testimony to the New York City Council Committee on Education
By Robert J. Tobias
June 27, 2005

Honorable Chair and Members of the City Council Committee on Education:

From the moment that the scores on the state ELA tests at grades 4 and 8 were released in mid-May, I received numerous phone calls asking me the same questions: What do you make of these test scores? Can the gains be true? Can we believe the results? What do you think happened? I'm asked these questions because people are mystified by large-scale, standardized tests and the arcane ways in which they are scored and reported. The accuracy of the test results have been rendered suspect by the scoring errors and cheating scandals of the recent past and the hype and spin that have surrounded the annual ritual of the release of a blinding array of numbers and statistics. People seek my opinions because, for many years I was in the middle of the maelstrom.

I served as the Executive Director of the New York City public schools' Division of Assessment and Accountability for 13 years, retiring at the end of 2001 to take a position as a clinical professor of teaching and learning at NYU's Steinhardt School of Education. As Executive Director, I had overall responsibility for the design, administration, scoring and reporting of the state and city testing programs. I worked with test publishers and psychometricians in the development of the tests, school personnel on matters of test security and was responsible for explaining why test scores when up or down annually to seven Chancellors, the press and the public. In carrying out these responsibilities over those 13 years, I developed an intimate and detailed knowledge of the factors that affected the test performance of students and schools, and the limitations of these instruments for answering the big! questions about whether our students are learning more from one year to the next and why.

I was in charge of a very capable staff of professional educators, researchers and psychometricians at the Division of Assessment and Accountability. The Division continues to set a national standard of excellence in the field of student assessment under the outstanding leadership of Lori Mei. Despite the efforts of these talented professionals, we are left scratching our heads trying to make meaning out of the torrent of data that has been released. This situation is more the rule than the exception when trying to infer the meaning of standardized test results. I believe this conundrum is the result of an unhealthy over-reliance on testing as a facile tool for educational reform and political advantage. In so doing, we have ignored the limitations inherent in large-scale assessment and we threaten to undermine the validity of test scores as evaluation and accountability measures. Standard! ized tests are tools designed to measure a construct. That construct may be mathematics knowledge and skill, reading achievement, or mastery of state learning standards. In our zeal to raise test scores we have forgotten that the test is a measure of learning and reified the test score to the status of learning itself. The test score has become the coin of the realm and raising scores through any means has become the Holy Grail. However, we must remember that there are many factors that can result in higher test scores. In order for higher test scores to be valid indicators of increased learning of the underlying construct, the factors responsible for the gains must not violate the conditions for valid measurement. Those conditions are related to the theory that is the foundation for the use of tests to measure learning.

The test questions are a sample of all the skills and knowledge that comprise the learning standards being tested. Since it is impossible to examine a student's mastery of all the content knowledge and skills required by the learning standards in all contexts, the test is designed so that one can make a reasonably accurate inference about the student's mastery of the standards from the individual's performance on the test questions. The fidelity of these inferences is known as the validity of the test. The recently-released state and city test scores have been used to make two principal inferences: first, that students are learning more and second that the increased learning is attributable to new policies and practices that have been instituted in the NYC public schools. While those inferences may ultimately be proven true, I believe that it is premature to accept them as vali! d pending a full examination of the possibilities that other factors may have caused the increase in scores.

First, let's consider the inference that the gains in test scores indicate that students have increased their mastery of the state learning standards in English language arts and mathematics. Before this proposition can be accepted, at least four possible rival explanations for the gains must be investigated. The gains may be the result of excessive test preparation that does not increase the subject-matter knowledge of the students, changes in the population of tested students that excluded more low performing students this year than last, changes in the content of the tests this year compared to last, and changes in the scaling and scoring of the tests. Obviously, the NYS and NYC Departments of Education and the test publishers are in a much better position to assess the influences of these factors than an outsider who does not have access to the information required for direct analysis. These! influences can be subtle and difficult to detect even for those who possess the information. Typically, even the proprietors of data must resort to examining circumstantial evidence in trying to determine the validity of their inferences. Indeed, there appears to be some circumstantial evidence that factors other than improved student learning may be at play in this year's rise in test scores; at least enough evidence to warrant further investigation.

First, an unprecedented increase in test preparation has been widely reported, including the adoption of a new program of interim testing by the NYCDOE. Much of this test preparation is not designed to increase student learning but rather to try to beat or "game" the test. It may even serve to narrow learning by focusing instruction on the sample of content and formats used on the test rather than the broad and deep knowledge required by the standards. Thus, some of the improvement in scores may be because students have become better test takers rather than better learners.

Second, a change in the state policy for exempting English language learners from English language arts testing and the city's new promotion policy at grade 3 appear to have reduced the numbers of low performing students tested this year as opposed to last, particularly on the state grade 4 ELA test. While the overall effect of these exemptions is not large, there is some evidence that they did serve to elevate the scores of certain districts and schools. A preliminary analysis of the data indicates that the NYC districts that showed the highest gains on the grade 4 ELA had the most students retained in third grade and the largest reduction in students tested in grade 4 this year compared to grade 3 last year.

The last factors-changes in the content of the test and the subtle influences of scaling-can only be investigated through expert comparison of the content and formats of the tests for 2004 and 2005 and an examination of the procedures and data used for scaling. In order to maintain test security, the test items are changed each year. Scaling is the procedure that is used to equate the scores for different forms of the tests so that scores from one year can be compared to those from other years. The raw score, or the number of correct answers, is converted into a scaled score which is used to determine the student's performance level. Scaling is a complex process performed by the publisher and it is an area that has resulted in scoring anomalies in the past. In 1999, 2001, and 2002, scaling anomalies resulted in the systematic under-scoring of NYC's ELA tests at several gr! ades and there is evidence that scaling anomalies resulted in over-scoring in 2000. Scaling anomalies have been documented in state and local assessment programs throughout the nation.

Next, let us consider the tenability of the inference that the gains in test scores are attributable to recent reforms in policies and practices enacted in the New York City public schools. Naturally, for this inference to be valid the first inference that the gains reflect real increases in learning must also be true. Researchers typically examine the validity of the assertion of positive effects for a policy or practice by comparing the test score gains of schools and school districts subject to the reforms with those that are not. The state ELA test results at grades 4 and 8 provide a natural opportunity for such comparisons. The tests are scored in two ways. First, there are scaled scores that range from about 400 to over 800, and reflect the students' overall achievement on the test. The second score is the performance level, ranging from 1 to 4, with level 3 signifying that a s! tudent has met state standards. On the performance-level score for grade 4, NYC students showed a large gain of nearly 10 percentage points in the percent scoring in levels 3 or 4. However, most of the other school districts, which were not affected by the city's reforms, showed gains that were close to those for the city. The other large cities showed a gain of nearly 11 percentage points and the gain for the state overall was 8. The state's charter schools and non-public schools also showed similar gains. On the other measure, the mean scaled-score gain in grade 4 in NYC was 8, compared to a mean gain of about 9 for the state overall. School districts in every state need/resource category showed gains ranging from 8 to 11 scaled scores. On the state grade 8 ELA test, NYC declined by about 3 percentage points, while the rest of the state rose by about the same amount. Accordingly, if policies and practices caused ! these gains, the policies and practices instituted in district! s outsid e the city and in charter and non-public schools were as or more effective than those of the NYCDOE.

In addition to the state ELA test scores, we have the results of NYCDOE's own city-wide tests in ELA and math at grades 3, 5, 6 and 7. It is difficult to make inferences about the factors responsible for the rise in city-wide test scores this year because there are no comparative data from non-city jurisdictions. While scores on the city-wide math tests have shown a steady trend of improvement over the past four years, the ELA scores showed dramatic gains in 2005-especially in grade 5- after stagnating since 2000. However, the charter schools in the city, which operate independent of city policies showed gains that in many cases were even greater than the large gains made by NYCDOE schools on these tests. In fact, the charter schools showed an overall increase of 21 percentage points in grade 5 compared to 19.5% for the NYCDOE schools. These patterns warrant further investigati! on before meaningful inferences concerning the causes of the gains can be asserted with validity. This additional investigation might include a focus on the following questions:

What procedures were used to scale and equate the city-wide tests this year and last?
How many anchor items were used to verify the calibration of the scaling of the items?
How were the items selected for the tests?
How did the items map onto the blueprint for the test in both years?
What were the relationships between raw scores and scaled scores this year and last?
How many students were exempted from testing this year and last and for what reasons?
What were the patterns of longitudinal gains for students using a repeated-measures analysis?
Were there relationships between a school's level of implementation of the NYCDOE reforms and their gains in test-score performance?

Answering these and other important questions will take time and resources, but will provide dividends in understanding the reasons for the gains in test scores and information that is integral to using the data for systemic planning. NYCDOE might also consider assembling an independent group of experts to oversee and audit this analytic work. An independent body will provide a fresh, disinterested perspective that will increase the credibility of the inferences that will be made form these data.

Respectfully Submitted,
Robert J. Tobias,
Clinical Professor and Director,
Center for Research on Teaching and Learning
NYU Steinhardt School of Education

Return to the New York City HOLD main page, the NYC HOLD News page, or the NYC HOLD Letters and Testimony page.