The NYT looks at scoring the PARCC tests

For the New York Times, veteran education reporter Motoko Rich went to a Pearson scoring site and talked with people about how they score the tests for PARCC. As I am in this industry and have spent many a month in that very facility in San Antonio, Texas, I have refrained from comment. However, other bloggers have picked it up, so it’s time for us to address the article’s central thesis: Teachers are better at scoring standardized tests than non-teachers.

My response, without hesitation, is that teachers aren’t any better at scoring the standardized tests used in our schools today under No Child Left Behind than non-teachers are. When it comes to other tests that aren’t used for accountability purposes, such as AP, teacher certification, and college admissions, having spent a part of your career teaching kids the material you’ll be reading can give you insight into the nuances of their responses. But the questions on tests used for accountability purposes are designed to measure whether students have achieved a certain minimum standard, not how far beyond that standard they can go.

Ms Rich seems to support a different view in the headline of her otherwise excellent article: “Grading the Common Core: No Teaching Experience Required.” Much support for the argument that teachers should be required to score PARCC tests comes from how Ms Rich deliciously describes the process:

On Friday, in an unobtrusive office park northeast of downtown here, about 100 temporary employees of the testing giant Pearson worked in diligent silence scoring thousands of short essays written by third- and fifth-grade students from across the country. … There was a onetime wedding planner, a retired medical technologist and a former Pearson saleswoman with a master’s degree in marital counseling.

Some teachers question whether scorers can grade fairly without knowing whether a student has struggled with learning difficulties or speaks English as a second language. … Experienced teachers also say that some students express themselves in ways that might be difficult for noneducators to decipher.

“Sometimes students say things as a student that as a teacher you have to interpret what they are actually saying,” said Meghann Seril, a third-grade teacher at Broadway Elementary School in Venice, Calif., whose students took the Smarter Balanced test this year. “That’s a skill that a teacher needs to develop over time, and as a grader, I think you need to have that as well.”

I have highlighted key phrases that reveal a few misconceptions on Ms Rich’s part about how this works and about how it is supposed to work, and I’ll discuss those one at a time, explaining why I think it reveals a misconception on her part. But first, I have to note that this response is not about the quality of the test itself, and it is not about how we are planning to use the results of these tests. I have serious misgivings about both the quality of the PARCC tests and the way some states have planned to use the results.

For example, citing a now-famous report last year from the American Statistical Association, Fair writes: “Basing teacher evaluations on inadequate standardized tests is a recipe for flawed evaluations. Value-added and growth measures are only as good as the exams on which they are based (American Statistical Association, 2014). They are simply a different way to use the same data. Unfortunately, standardized tests are narrow, limited indicators of student learning. They leave out a wide range of important knowledge and skills. Most states assess only the easier-to-measure parts of math and English curricula (Guisbond, et al., 2012; IES, 2009).”

Rather, we look here at just the scoring of those tests, however narrow, poorly written, or long those tests may be, and however misguided our use of student scores from those tests may be.

Some teachers question

The excerpt I selected was from very near the top of the article, and when I read weasel words like “some teachers,” I start thinking, “How much more of this am I going to read? Let’s see, How long is it? Well, OK, let’s give it a fair shake.”

Ms Rich does quote some teachers that she obviously talked to, but we have no idea if this is a representative sample of what teachers believe, if it represents a majority of teachers, or if it represents just a few. That is, the premise upon which the entire article is based may not even be newsworthy. I’m glad the Times decided to focus on something as important as this is, but I’ve never considered this question newsworthy.

It’s just one potential error in judgment, though, so we proceed.

Can grade fairly

This phrase has just three little words in it, and every one of them has a problem. Let’s work in reverse. Nobody really knows what it means to grade “fairly.” As a seasoned journalist, Ms Rich should have applied her own experience here: People don’t want “fair” coverage of news stories; they want their own viewpoints reported as news. That means, what’s “fair” for one person or organization will be “unfair” to another. And presenting both sides of a debate for the sake of appearing to balance both sides of a story can invalidate or at least obscure the true analysis and value of the news itself.

In education, we consider a question and the whole test “fair” if all students taking it have the same opportunity to demonstrate achievement at a certain level. That may seem like a narrow definition, but most of the money for developing and administering tests is spent on making them “fair.” We provide, for instance, a copy of the tests in Braille if needed to ensure that blind students have the same opportunity to succeed on it as sighted students. The Braille versions cost our states lots of money to produce, which is disproportionate to the number of kids who need the Braille test books, but states are required to ensure fairness of the test.

That example of the fairness of the tests doesn’t really address fairness in the “scoring” of the test, however. Let’s consider the Braille booklet. What the people in San Antonio would see from kids who used the Braille version of the test would be either typed responses, which look like the typed responses of sighted students, or transcribed responses for the essay questions, which look like adult handwriting but are otherwise indistinguishable from the responses from sighted students. By training scoring personnel that handwriting is not a part of the score students should receive, we thus ensure scoring fairness for blind students. To ask whether the questions themselves are fair for blind students is a completely different question and one worth looking into. But responses are certainly scored fairly for blind students.

Likewise for other groups: Scoring personnel are trained to avoid biases untrained personnel might think happen all the time. We discussed central tendency bias in a 2009 article. We also point out biases that sometimes occur due to regional differences. For example, kids from Maryland know what a “waterman” is, and kids from Illinois don’t. Using that word on a PARCC test would, therefore, make a question unfair to kids from Illinois, but not making sure scoring personnel from Texas know what the word means could potentially bias their scoring against kids from Maryland, whose responses would be meaningless to them. But personnel receive extensive training on how to deal with regional differences in student responses, which we hope makes the scoring “fair” for kids from all regions where the PARCC tests were administered.

Other examples of bias include an assumption that longer responses or responses that fill more of the page actually say more relevant stuff. Since long responses are sometimes just repetitive, sometimes rambling, but sometimes substantive, scoring personnel are trained to ignore length when assigning scores to student responses. I am surprised that Ms Rich would bring in the subject of fairness in scoring so early in the article and not address scoring biases, the basis of unfairness or unequal application of scoring guidelines, anywhere else in her extensive article.

Next, the beginning of this phrase: can. Please understand, I’m not prejudiced against teachers, but the quality of being a teacher in no way confers any additional human ability, including scoring ability, on a person. Anybody can do this, given proper training. Sometimes that training doesn’t succeed, and Ms Rich correctly points out that a significant number of people who sign up to score the tests can’t demonstrate sufficient mastery of the scoring task. These scorers are dismissed, so they don’t put any scores on actual tests. The expense of training them unsuccessfully and not getting any productive work out of them is Pearson’s to bear, but scoring requires weeks of focused concentration on student answers that all begin to look the same after a few hours—patience and skill some teachers and non-teachers alike just don’t have.

Finally, the word “grade” has a different meaning from the word “score.” In 2006, we pointed out significant differences between scoring and grading, and we refer you to that document for a more complete discussion. The primary difference between scoring and grading is that we assign grades to students we know but assign scores to responses we read. Standardized tests and the essay questions on them are not “graded”; they are “scored.” Let’s remember that distinction when we write about these tests.

This is why classroom teachers sometimes flunk out at scoring sites like the one in San Antonio: they can’t stop grading. In grading students, teachers act as detectives to find out what students know but don’t always say. In scoring essays, scorers act as detectives to find out what students say regardless of what they know.

The difference between scoring and grading is analogous to a journalistic activity that millions of New York Times readers know. When sports writers rank the Top 25 College Football Teams this fall, they’ll be assigning a “grade.” “Ohio State is a better football team than Michigan,” they’ll be effectively saying. Michigan may be ahead in certain stats and may even win a game occasionally against Ohio State, but these statistics aren’t subject to judgment, interpretation, or selectivity by individual journalists, a quality those stats share with “scores” on essay questions.

Sometimes stats or scores give us a valid snapshot of what a team’s overall strength is or what a kid’s “grade” should be, and sometimes they don’t. Some stats and scores are more valuable than others when it comes to determining a team’s “rank” or a student’s “grade.” Journalists can consider whether a team has suffered from injuries, which artificially deflates certain stats, but no matter how much they think a team should have had different stats, those journalists can’t change the numbers. All they can do is adjust how much consideration they give those numbers, and that sometimes riddles their rankings with bias but often paints, in their judgment, a more accurate picture of a team’s ranking. Teachers also take into account differences in individual students when “grading” and introduce biases that may or may not help students succeed.

Many times teachers who are unable to take off their grading hats and put on scoring hats would make assumptions of what kids probably know. The reality is that the kid whose paper is in front of them probably doesn’t even live in their state, has never been in their classroom or benefited from their excellent lessons, and in all likelihood doesn’t actually “know” all the things a classroom teacher would assume he or she knows.

Teachers can “grade” students based on what they know about those students and on what they know those students have been taught, probably in their own classrooms, since teachers mostly grade their own tests. However, scorers don’t know anything about the students. They don’t know what their teacher has taught, or tried to teach, them. They don’t know what proficiency or deficiency they have. They don’t know what else that student knows, basically. All they know about the student is what the student writes on the page, so that’s all they can score. True grading would incorporate lots of other knowledge about students that, for better or worse, simply cannot be demonstrated on an anonymously scored standardized test. And that’s the point: it ensures fairness, or at least scoring that isn’t based on any prejudices scorers may have about individual students or student groups.

Student has struggled with learning difficulties

As with blindness, knowledge that the student has struggled with learning difficulties would potentially bias the scoring. When assigning scores, scorers are trained to compare what the student writes with scoring guidelines or rubrics. Then, they apply the same guidelines to every response regardless of that student’s history of learning disabilities. It would be a disaster if a parent presented an essay that received a certain score alongside a much lower quality essay with a better score, given because a teacher said, “We took into consideration that the student who received the higher score suffers from a learning disability.”

Furthermore, knowledge that a kid is struggling or has struggled with learning disabilities is potentially sensitive information. Privacy advocates would have a fit if this information were passed along to scorers in Texas.

Now, that would be news if it happened on a standardized test being used for federal accountability purposes. Kids with learning disabilities are expected to get lower scores. Inclusion of this so close to the top of an article in the paper of record makes me further question the value of reading on. Ms Rich has not considered the alternative and doesn’t investigate it anywhere else in the article. If scorers knew a student whose paper they were scoring suffered from learning difficulties, we would have an unfair application of scoring guidelines, and that is, in many places, against the law and would certainly violate contract terms made between Pearson and the states in PARCC.

English as a second language

Taking into account that a student has what are known as “second language indicators” in his writing, such as omission of articles, incorrect verb phrases, or word order issues, would also lead to bias in scoring. On math and reading comprehension test questions, therefore, scorers are explicitly trained to ignore second language indicators and score based on meaning, as long as that meaning is clear enough for scorers to determine with reasonable effort. On writing tests, however, second language indicators are scored, and they usually result in a lower overall score on specific questions.

For PARCC, the scoring rubric assigns a lower score if “The student response to the prompt demonstrates limited command of the conventions of standard English at an appropriate level of complexity. There may be errors in mechanics, grammar, and usage that often impede understanding.” A high score is assigned if “The student response to the prompt demonstrates full command of the conventions of standard English at an appropriate level of complexity. There may be a few minor errors in mechanics, grammar, and usage, but meaning is clear.” Kids whose writing contains several second language indicators throughout will get lower scores for writing. This language is not part of any math rubric, though.

Students … express themselves

This is further evidence that Ms Rich doesn’t understand the difference between scoring and grading. We discussed it above, but we further point out that every student’s response must be scored according to the same guidelines. Scorers can’t take into account differences in the way students express themselves and must look simply at the content of what the student wrote. They are trained what to do when they encounter a response written in such a way that they can’t be sure how to apply the scoring guidelines.

Difficult for non-educators to decipher

Pearson has educators on staff, to be sure. There are also significant levels of leadership in the scoring rooms shown in the photographs in Ms Rich’s article. If the scorer has difficulty deciphering part of a response, the response is shuttled to members of the leadership team.

As a grader … [need to have a skill to decipher]

Pearson scoring personnel are not “graders.” They aren’t supposed to be “grading” classroom tests; they’re under contract to “score” standardized tests. We need to make this distinction much more clearly in future articles if we’re going to write about the scoring process.

About the Author

Paul Katula
Paul Katula is the executive editor of the Voxitatis Research Foundation, which publishes this blog. For more biographical information, see the About page.