Thesaurus Unbound

If Roget's is becoming a relic, what lies ahead?

When you are searching for a word that is more precise than another though similar in meaning, you don't browse Piozzi's. Yet British Synonymy, the first English book of synonyms, was written by Hester Lynch Piozzi. Nor do you grab your Girard's. Published 76 years before Piozzi, the 1718 book of French words appears to be the first collection of synonyms in any language. What you reach for is your Roget's. Originally published in 1852, having been compiled over the course of more than four decades by the eponymous but strangely anonymous Peter Mark Roget, the thesaurus we know and love was not the first of its kind. Roget's was the sixth or seventh in a line of, well, synonymous—but not identical—compendiums. Now, after a century-and-a-half career as a publishing juggernaut, the bound and beloved version is becoming a historical relic in the computer era.

It's at such turning points that we look back at beginnings, and now the great lexicographers themselves are emerging in the limelight. With his fresh and thoroughly researched account of Roget's life, Joshua Kendall sets out to do for his classic what Simon Winchester did for the OED's dictionarians in The Professor and the Madman, while Noah Webster will get some attention later this year in Websterisms  and Righting the Mother Tongue. (Already, Kendall is at work on a longer look at Noah Webster.) *

Kendall reports that Roget married late but happily and lived until 90, yet he was born into a family that was shockingly afflicted by mental illness. His maternal grandmother, mother, and only sister suffered from major depressive episodes. Even Roget's infancy was blighted by tragedy. His father contracted tuberculosis six months after Roget was born. The child was sent to live with his grandfather and didn't see his parents again until he was 2 years old. Roget's father died was when he was 4, and the death devastated Roget's mother, Catherine, and financially wrecked their small family.

After uprooting the household several times throughout Roget's childhood, Catherine eventually moved her family to Edinburgh, so her shy and studious teenage son could attend the university. There he was inspired by a number of marvelous scholars and teachers who challenged him to think about the imperfections of language in the pursuit of knowledge. Though he began work on his thesaurus a few years later, Roget didn't attempt to publish his collection of synonyms until he was 73. It appears that he was finally provoked to action by the popularity at the time of Piozzi's British Synonymy, which he thought greatly undeserved.

Predictably, Kendall reaches for a therapeutic analysis: In the midst of this tumult, Roget made lists as a way of exercising some control in his life. One of his first lists, at age 8, was of Latin words and their English translations. As an adult, he kept lists of important events, including "dates of deaths" of friends and relatives, and when he was working as a young doctor in Manchester, England, he began to organize a list of ideas and related synonyms that would one day become the thesaurus. There is surely something to this: No doubt Roget's manipulation of words gave him a feeling of mastery. Still, a relentless drive to itemize and taxonomize isn't always a symptom. It's also the basic character of the scientific mind, which Roget clearly had. In addition to his thesaurus, he achieved renown in his lifetime for a two-volume work classifying plants and animals and for his scientific observations about human optics.

It was precisely that scientific bent that was his book's distinction. The organization of Piozzi's and Girard's, as well the handful of others that were published before Roget's 1852 thesaurus, was scattershot by comparison. Roget's, which was remarkably successful in its author's lifetime, was a comprehensive system of synonyms and antonyms. Roget built a numbered inventory of 1,000 fundamental ideas, like "existence," which appeared with a set of related words, ens, entity, being, existence. Later, he came up with a series of six nested classes, which were inspired by the Linnaean classification of animals. Thus, Kendall writes, " 'Perfection' falls under Class V, 'Words relating to the Voluntary Powers,' Division I, 'Individual Volition,' and Section i, 'Volition in General.' " The higher the level, the more abstract the idea; the lower the level, the more specific. Roget considered his book the opposite of the dictionary: You started with the idea and then found the word. His project was so original and so immense in scope that it has taken not just time but the connectivity, the huge databases, and the broad online access of modern information architecture even to begin to outstrip it.

Exactly where print reference will be in 10 years time is still murky, but the writing is quite clearly on the wall or, if you prefer, the desktop. A recent survey to be published by the Dictionary Society of North America found that while students use dictionaries as much as they ever did, the online versions have overtaken paper. Many students use and (also, which in November 2007 had 15.1 million unique visitors. Conversely, the 2008 print edition of Quid, formerly one of France's most popular encyclopedias, was canceled last month for want of sales.

Happily, if the computer processing of words is killing reference books, it's also making them better. In particular, word reference is morphing faster and smarter than any other kind of compendium out there. The innovation is not just a matter of a new medium that permits us to get online what we used to turn pages for. There has been an evolutionary leap, too: The digitization of words in time allows us to see language as it really is—not so much an abstract code as a dynamic system.

One of the most important spurs to word research is the increased use of the corpus, the term used to refer to any large body of written or spoken communications, be it a collection of medieval manuscripts or a folder of sound files. Diverse scholars of language have long amassed corpora, such as books on particular topics or writings by particular people, in order to analyze the language of the whole. Before the computer era, corpus work required painstaking, slow tabulation. With a computerized corpus, you can search and count (and run any other kind of linguistic analysis) with greater ease. Corpus linguistics means that the language of thousands of people can be mined by lexicographers, reflecting the facts of English as it is spoken or written by a population, not just English as it was spoken by Peter Mark Roget. If Roget's Thesaurus, along with Webster's and Johnson's original dictionaries, is the idiosyncratic cartography of brilliant 19th-century explorers, then this stuff is GPS.

While computerized corpus research has grown since the '60s, every few years brings greater sophistication to the field. In addition to widely used corpora like WordNet and the Bank of English, which contain millions of words, lexicographers can drill further into language with specialized databases, like the Enron corpus (the company's internal e-mails) or a corpus of suicide notes. Oxford University Press relies on the Oxford English Corpus, a 2-billion-word database begun in the year 2000 to capture 21st-century English. The corpus is used to update products like the Oxford American Writer's Thesaurus. A forthcoming edition will include nuanced distinctions between words like eccentric and quirky. A traditional thesaurus would simply list these words together, but eccentric is typically used of the very rich or reclusive whereas quirky is used less about people than about their style.

One of the most staggering advances in word reference is the forthcoming Historical Thesaurus of English, which will list all the words of English—modern English words as well as long-gone versions, such as the English spoken in the year 1000. The Historical Thesaurus will be a kind of companion to the famous 20-volume etymological Oxford English Dictionary. In fact, it uses the words from the OED, but, like Roget's, it is subdivided into topics, starting with the most general and then branching into ideas of greater specificity. The very patterns of word usage will document the popularity of ideas throughout history. The Historical Thesaurus will be produced in book format as well as online. Unbelievably, the project was begun pre-computer, in 1964, by a group of brave souls at the University of Glasgow. But given the number of cross-references, the fuzziness of categories, and the massive challenges of manipulation, it's hard to imagine it being finished without computers.

As old as it is, Roget's system of classification still gets play in the research world. Some lexicographers are importing his idea structure into language databases to solve word processing problems, like disambiguating words with more than one meaning. Others are trying to reorganize language data from the ground up, ripping out the hierarchical structure of Roget-style classification and replacing it with more realistically overlapping groups of ideas.

Of course, the entire Web can be used as a corpus, and its ever-changing nature makes it a particularly valuable one. Dictionaries and thesauri, even online versions, record words that remain constant over periods of time. But language is a roiling thing, and its dynamism is sometimes of the moment, not just the year or the century. Bloggers, like Mark Peters, track "nonce" words, which, by definition, have very short life spans. Speakers drop these words almost as soon as they pick them up, so few will be recorded on paper, yet they are still real words. Indeed, endlessish and crapportunity are not just the low-hanging fruit of the new world of word reference; they represent our fundamentally changed relationship with our own language. In the spirit of Roget, if not by his book, all speakers can now freely access information about the perfect Platonic classifications of words but also, crucially, about how they live.

Correction, April 8, 2008: This story mistakenly stated that Joshua Kendall is at work on a new book on Samuel Johnson. He's writing on Noah Webster. (Return  to the corrected sentence.)

Christine Kenneally is the author of The First Word: The Search for the Origins of Language. Her writings can be found on the blog

