Machines Shouldn’t Grade Student Writing—Yet
Standardized tests will finally ask good essay questions. But robot grading threatens that progress.
Jack Hollingsworth/Thinkstock Images.
In 2002, Indiana rolled out computer scoring of its 11th grade state writing exam. At the time, ETS, the company that developed Indiana’s software, said automatic writing assessment could help cut the state’s testing budget in half. But by 2007, Indiana had abandoned the practice.
Why? Though ETS’s E-Rater proved adept at scoring so-called “naked” essays based only on personal opinion, it couldn’t reliably handle questions that required students to demonstrate knowledge from the curriculum. State testing officials tried making lists of keywords the software could scan for: in history, for example, “Queen Isabella,” “Columbus,” and “1492.” But the program didn’t understand the relationship between those items, and so would have given full credit to a sentence like, “Queen Isabella sailed 1,492 ships to Columbus, Ohio.” Cost and time savings never materialized, because most tests also had to be looked at by human graders.
Indiana’s experience is worth keeping in mind, since although the technology has not advanced dramatically over the past decade, we’re now in the midst of a new whirlwind of enthusiasm about electronic writing assessment. Last month, after a study from Mark Shermis of the University of Akron announced that computer programs and people award student-writing samples similar grades, an NPR headline teased, “Can a Computer Program Grade Essays As Well As a Human? Maybe Even Better, Study Says.” Education technology entrepreneur Tom Vander Ark, who co-directed the Shermis study, hailed the results as proof that robo-grading is “fast, accurate, and cost-effective.”
He is right about “fast”: E-Rater can reportedly grade 16,000 essays in 20 seconds. But “accurate” and “cost-effective” are debatable, especially if we want students to write not only about what they think and feel, but also about what they know. Testing companies acknowledge it is easy to game the current generation of robo-graders: Such software rewards longer word counts, unusual vocabulary, transition words such as “however” and “therefore,” and grammatical sentences—whether or not the facts contained within the sentences are correct. To address these problems, the Hewlett Foundation, which also paid for the Shermis study, is offering a $100,000 prize to the team of computer programmers that can make the biggest strides in improving the technology.
The recent push for automated essay scoring comes just as we’re on the verge of making standardized essay tests much more sophisticated in ways robo-graders will have difficulty dealing with. One of the major goals of the new Common Core curriculum standards, which 45 states have agreed to adopt, is to supplant the soft-focus “personal essay” writing that currently predominates in American classrooms with more evidence-driven, subject-specific writing. The creators of the Common Core hope machines can soon score these essays cheaply and quickly, saving states money in a time of harsh education budget cuts. But since robo-graders can’t broadly distinguish fact from fiction, adopting such software prematurely could be antithetical to testing students in more challenging essay-writing genres.
Unlike the Common Core, most existing state standardized tests (and the SAT) ask students to reflect on abstract ideas like “patience” or “the benefits of laughter”—two real essay prompts included in the Shermis study. The grading rubrics used by humans to evaluate these essays focus on grammar, organization, tone, and sentence structure. Writers aren’t penalized for factual inaccuracies, so computer programs, which already have an excellent command of grammar and stylistic coherence, are often as reliable graders of these essays as people.
Better state tests give kids reading assignments and then ask them to write about what they have read. Computer programs like Bookette and the Intelligent Essay Assessor are “tuned” to score specific reading-based prompts by being fed sample essays graded by teachers.
But of the eight state standardized-test writing prompts Shermis looked at in his study, none required students to demonstrate knowledge beyond what could be gleaned from a specific text, and four required “relatively content-free” responses. The Common Core, meanwhile, has much higher ambitions for student writing. Here is an example of a Common Core essay prompt—the kind students across the country should be encountering over the next five years:
Compare and contrast the themes and argument found in the Declaration of Independence to those of other U.S. documents of historical and literary significance, such as the Olive Branch Petition.
Brown University computer scientist Eugene Charniak, an expert in artificial intelligence, says it could take another century for computer software to accurately score an essay written in response to a prompt like this one, because it is so difficult for computers to assess whether a piece of writing demonstrates real knowledge across a subject as broad as American history.
Dana Goldstein is a Brooklyn-based journalist, a Schwartz Fellow at the New America Foundation, and a Puffin Fellow at the Nation Institute