Can You Identify an Author By How Often They Use the Word “The”?
Is it possible to identify the author of a book based on the frequency of just a few basic words such as “such,” “as,” or “or”? Do novelists, like Ernest Hemingway and JK Rowling, have such a unique literary fingerprint that their writing can always be detected, even if they change both genre and intended audience?
In my book, Nabokov’s Favorite Word Is Mauve, I dive deep into the question of finding a fingerprint for all writers. I look at a handful of cases that will catch the interest of any book nerd. How obvious should it have been, after a little bit of word-counting, that Stephen King was writing undetected for years under the pseudonym Richard Bachman? If we look at an author like James Patterson who is credited with co-writing most of his books, does the writing more closely resemble his own solo books or those of his co-authors?
The numbers show that writers do have a unique pattern of writing, and that their word usage is both consistent and predictable. To demonstrate this, I didn’t rely on any advanced machine learning or overly complicated formulas. Instead, I looked at the work of Frederick Mosteller and David L. Wallace, who wrote on the topic in the 1960s and relied on methods so simple that they were actually cutting words out of paper to count them. They were interested in the Federalist Papers, essays written anonymously arguing for the ratification of the Constitution that had been a mystery for over 150 years. Both Alexander Hamilton and James Madison claimed later in life that they were the author of the 12 “disputed essays,” but the numbers showed that only Madison’s claim held up.
Mosteller and Wallace treated words like independent and identical random variables—think a dice roll or a coin flip— which meant that if writers changed their style, either by natural evolution or because they were writing a different type of book, their formulas would not work. But as it turns out, writers are unwilling and unable to change. When I ran tens of thousands of tests on hundreds of books by classic and popular authors, the Mosteller and Wallace method was over 99 percent accurate at determining the author.
To push the limitations, I downloaded all the novel-length stories of the fifty most prolific Twilight fan-fiction authors on Fan-Fiction.net—each of whom had written more words in the Twilight universe than Stephenie Meyer herself. I thought that if you compared writers writing at about the same time as each other in the same universe about the same characters and in the emulated style of one author, the word frequencies might not have as much determinative value. Instead, it was 99.7 percent accurate at picking out who the true fan-fiction author of each story was.
How does this work so well? The easiest way to demonstrate it is visually. Below, I’ve included the books of three authors: Michael Connelly, Louise Penny, and J.K. Rowling. All three are best-selling authors, and I limited Rowling’s works to the detective novels written under the pen name Robert Galbraith so that I would be looking solely at detective novels.
The plot below looks at the frequency of 70 basic words that Mosteller and Wallace identified in their original paper over 50 years ago. Click on different words to change the graph. As you can see, Rowling’s Robert Galbraith books (dark blue) mostly fall very close to her Harry Potter books. The variation between books by the same author is much smaller than variation between the different authors.
Even if the books weren’t marked with colors above, most of the time you could look at just two words and guess who wrote each book. Now imagine if you had the ability to look at dozens of words at once. With 70 words adding evidence to authorship, and a simple but clever formula to balance the informativeness of each word, it’s no wonder Mosteller and Wallace’s methods are so accurate.
I wanted to update for the year 2017, and also give everyone a chance to play around with Mosteller and Wallace’s own findings. I’ve built an app, called the MoWa Literary Fingerprint, which you can download today for your iPhone. Take a picture from a page of the book, and the app will convert the image to a text file and then run the numbers to guess who the author is.
Watch the video below for a full explanation.
The app isn’t as accurate as 99 percent, as it is just looking at a few hundred words at a time instead of tens of thousands. An unusual amount of “and”s or “but”s on one page could throw off the methods. But, if you use the head-to-head feature it still should be accurate around 85 percent of the time. (Taking a clean picture will increase the accuracy.) You might be impressed at how few words can be a tip-off to authorship. Remember, the app is not using stored phrases or known passages, just a few word frequencies to make its prediction.
The Mosteller and Wallace methodology is interesting for many reasons related to identifying authors of text. But perhaps the most important conclusion you can draw is that it works because writers have a singular voice they will not or cannot change. Using the knowledge that writers have an immutable pace and style, we can find out so much else about our favorite writers. We can see which authors write the most complex sentences, who uses the most cliches, which adult novelist writes at the simplest grade level, who has the most predictable sentence structure, and who uses exclamation points the least and most.
The book is a celebration of the finer points of writing, through an analytic point of view, that is only possible because of Mosteller & Wallace’s original paper and finding. If you don’t believe the stable style of writers yourself, download the app today on your phone and give it a whirl on your personal bookshelf.