Future Tense

Machines Shouldn’t Grade Student Writing—Yet

Standardized tests will finally ask good essay questions. But robot grading threatens that progress.

Students writing.
The continuing evolution of asking—and grading—standardized-test essay questions

Jack Hollingsworth/Thinkstock Images.

In 2002, Indiana rolled out computer scoring of its 11th grade state writing exam. At the time, ETS, the company that developed Indiana’s software, said automatic writing assessment could help cut the state’s testing budget in half. But by 2007, Indiana had abandoned the practice.

Why? Though ETS’s E-Rater proved adept at scoring so-called “naked” essays based only on personal opinion, it couldn’t reliably handle questions that required students to demonstrate knowledge from the curriculum. State testing officials tried making lists of keywords the software could scan for: in history, for example, “Queen Isabella,” “Columbus,” and “1492.” But the program didn’t understand the relationship between those items, and so would have given full credit to a sentence like, “Queen Isabella sailed 1,492 ships to Columbus, Ohio.” Cost and time savings never materialized, because most tests also had to be looked at by human graders.

Indiana’s experience is worth keeping in mind, since although the technology has not advanced dramatically over the past decade, we’re now in the midst of a new whirlwind of enthusiasm about electronic writing assessment. Last month, after a study from Mark Shermis of the University of Akron announced that computer programs and people award student-writing samples similar grades, an NPR headline teased, “Can a Computer Program Grade Essays As Well As a Human? Maybe Even Better, Study Says.” Education technology entrepreneur Tom Vander Ark, who co-directed the Shermis study, hailed the results as proof that robo-grading is “fast, accurate, and cost-effective.”

He is right about “fast”: E-Rater can reportedly grade 16,000 essays in 20 seconds. But “accurate” and “cost-effective” are debatable, especially if we want students to write not only about what they think and feel, but also about what they know. Testing companies acknowledge it is easy to game the current generation of robo-graders: Such software rewards longer word counts, unusual vocabulary, transition words such as “however” and “therefore,” and grammatical sentences—whether or not the facts contained within the sentences are correct. To address these problems, the Hewlett Foundation, which also paid for the Shermis study, is offering a $100,000 prize to the team of computer programmers that can make the biggest strides in improving the technology.

The recent push for automated essay scoring comes just as we’re on the verge of making standardized essay tests much more sophisticated in ways robo-graders will have difficulty dealing with. One of the major goals of the new Common Core curriculum standards, which 45 states have agreed to adopt, is to supplant the soft-focus “personal essay” writing that currently predominates in American classrooms with more evidence-driven, subject-specific writing. The creators of the Common Core hope machines can soon score these essays cheaply and quickly, saving states money in a time of harsh education budget cuts. But since robo-graders can’t broadly distinguish fact from fiction, adopting such software prematurely could be antithetical to testing students in more challenging essay-writing genres.

Unlike the Common Core, most existing state standardized tests (and the SAT) ask students to reflect on abstract ideas like “patience” or “the benefits of laughter”—two real essay prompts included in the Shermis study. The grading rubrics used by humans to evaluate these essays focus on grammar, organization, tone, and sentence structure. Writers aren’t penalized for factual inaccuracies, so computer programs, which already have an excellent command of grammar and stylistic coherence, are often as reliable graders of these essays as people.

Better state tests give kids reading assignments and then ask them to write about what they have read. Computer programs like Bookette and the Intelligent Essay Assessor are “tuned” to score specific reading-based prompts by being fed sample essays graded by teachers.

But of the eight state standardized-test writing prompts Shermis looked at in his study, none required students to demonstrate knowledge beyond what could be gleaned from a specific text, and four required “relatively content-free” responses. The Common Core, meanwhile, has much higher ambitions for student writing. Here is an example of a Common Core essay prompt—the kind students across the country should be encountering over the next five years:

Compare and contrast the themes and argument found in the Declaration of Independence to those of other U.S. documents of historical and literary significance, such as the Olive Branch Petition.

Brown University computer scientist Eugene Charniak, an expert in artificial intelligence, says it could take another century for computer software to accurately score an essay written in response to a prompt like this one, because it is so difficult for computers to assess whether a piece of writing demonstrates real knowledge across a subject as broad as American history.

Still, when it comes to solving tough computer programming problems, sometimes scientists discover a shortcut that allows them to make strides much faster than they assumed they could. A good example comes from another, closely related branch of the “natural language processing” field—the effort to create more accurate grammar checkers inside software like Microsoft Word. From the 1950s until the late 1990s, programmers believed good grammar checks would require a computer to truly “understand” English, so they tried to approximate the “listening” process babies and children undergo as they learn to speak, hand-coding vocabulary definitions and grammar rules into the computer. The problem was that computers turned out to be terrible at colloquialisms, metaphors, aphorisms, tone, clarity, and all the other non-rule-based features that make a language both lively and correct—and that native speakers intrinsically grasp.

So instead of teaching computers vocabulary and grammar, programmers tried scanning thousands of pages of text into the computers, and then used statistics to analyze the probabilities of various word and sound groupings. It worked. This innovation led not only to improved grammar checking and to the simple automated writing assessment we have today, but also to software like Google Translate, which, while imperfect, far outperforms previous generations of translation tools like AltaVista’s BabelFish.

Could there be a similar innovation on the frontier of essay grading—one that would allow computers to more accurately score even sophisticated forms of writing? A paper by ETS’s Derrick Higgins and Beata Beigman Klebanov points to a potential path forward: using Web databases of human knowledge, like online encyclopedias and news repositories, to check how factual and intellectually sophisticated an essay truly is.

An experimental program called the Stanford Named Entity Recognizer can pick out proper nouns like “Chaucer” and “Albert Einstein” with 82 percent precision. Another program, called ReVerb, can recognize about one-third of the “facts” writers present on such topics, such as the century in which Chaucer lived (the 14th) and Einstein’s most famous scientific contribution (the Theory of Relativity). Since computers can already recognize phrases that hint at an argument—such as “caused by” and “led to”—it isn’t inconceivable that in coming years, a program will be able to search Web sources on a certain topic, and then use its findings to assess the plausibility of a writer’s assertions.

Currently, however, computers struggle with determining how trustworthy various Web sources are, and they can’t weigh or synthesize competing claims from good sources. The ETS researchers cite the example of a real-life grad school applicant who argued in an essay that “Albert Einstein’s accidental development of the atomic bomb has created a belligerent technological front.” Historians and scientists debate the nature of Einstein’s role in the development of the atomic bomb, and human graders could certainly argue endlessly about whether the writer’s use of the words “accidental” and “belligerent” are historically justified in this instance (or whether his deployment of the perfect tense is grammatically sound).

Wes Bruce, Indiana’s chief assessment officer, has concluded the technology is promising, but that it must improve before it can be used on exams that are both high-quality and high-stakes: the kind that not only test for knowledge, but also determine whether students graduate from high school, or whether teachers receive high “value-added” scores for raising student achievement. Artificial-intelligence scoring, he says, is “pretty artificial, not too intelligent.”

For now, at least.           

This article arises from Future Tense, a collaboration among Arizona State University, the New America Foundation, and Slate. Future Tense explores the ways emerging technologies affect society, policy, and culture. To read more, visit the Future Tense blog and the Future Tense home page. You can also follow us on Twitter