Blogging the Stanford Machine Learning Class

Could I Use the Stanford Machine Learning Class To Create a Robot Journalist?
What's to come?
Nov. 16 2011 3:42 PM

Blogging the Stanford Machine Learning Class


Could I start using this stuff to create an automated journalist?

Robot writer.

Illustration by Paul Fleet/iStockphoto.

It occurred to me halfway through this week’s machine learning lectures that I could actually use the stuff we’re learning for something other than the homework assignments. I was reminded of a poster hanging in my seventh-grade math class titled “When will I ever need to know this stuff?” with a long list of professions and how they made use of trigonometry, algebra, fractions, and so forth—all of which was irrelevant to me at the time since I was headed to play for the Philadelphia Phillies.

Since I got stuck in journalism instead, I’ve been thinking about how machine learning might be of use to the field. We’re still on the subject of neural networks, which use a tremendous series of matrices and logarithms to simulate the way your brain learns. All of this would have been extremely useful when my friend Farhad Manjoo and I were trying to write a program called Robottke for Slate that would predict which Web pages popular blogger Jason Kottke would link to in a given day.

All the machine-learning algorithms we’ve covered in class so far have essentially worked the same way: First, you identify a set of real-world data that you’d like to “learn”—that is, simulate in a model. The operating example we’ve been using is housing prices, in which you have, say, 5,000 examples of houses with information like square footage, median income of the neighborhood, number of bathrooms, and so forth, as well as how much they sold for. There’s no limit to how complex you can get, up to hundreds or thousands of data points per house. This is known as your “training set”—the information you’ll feed to your algorithm so that it can slowly adjust its many parts—that is, “learn” the quirks of the housing market. At the end, you’ll be able to test its quality by running all these real-world examples through it and seeing how well it does at getting the right answer, or an answer close to the right one. If it does well, you hope you can trust it to predict prices for homes not yet on the market.


In the case of Robottke, we already had a very thorough “training set” of real blog posts by Kottke going back five years, with information like how often he would link to a given site, what keywords he likes to assign to his posts, and which bloggers and Twitter users he frequently turned to for ideas. Lacking any experience in machine learning, our algorithm was blissfully crude, assigning weights to these factors somewhat arbitrarily until the results looked reasonably accurate. (Essentially, we were doing the same thing a machine-learning algorithm does, just very badly.)

Had I taken this class first, I might have been able to construct a neural network to take in all this information we had and truly algorithmize Kottke to the point that, every day, I could feed Robottke with thousands of links he might be interested in linking to, casting a wide net, and getting back a “yes” or a “no” from the program about whether to include them in Robbotke’s daily feed. I emphasize “might” because it’s a lot easier to do this in the controlled environments that professor Ng provides us for the programming assignments, complete with helpful hints about which programming functions to use and benchmark values to make sure we’re on the right track. But in theory, this isn’t necessary. One of the beauties of machine learning is that it’s easy to tell whether you did a good job, because when you feed your original training data back into the finished product—the data for which you already know the correct answer, since it came from real life—it should predict the answer correctly at least 95 percent of the time. (Curiously, if it predicts the right answer 100 percent of the time, you should worry—it may mean your algorithm is too neatly tailored to the precise training examples, and therefore unequipped to handle data it’s never seen before.)

The possibilities abound. Newspaper and magazine websites are always trying to figure out ways to make readers maximally likely to click on stories and then click on more stories. There is some empirical research going on in this field, like randomly serving readers one of two headlines to see which is more enticing, but a machine could do much better. And tell us something about what readers really want in the process.

Grades*: 5/5
Review Questions: 100/100 (programming exercise)
*Note: Both grades docked 20 percent since I missed the deadline, for which I received an automated email reprimand. I’ll get back on track, professor Ng!



Slate Plus Early Read: The Self-Made Man

The story of America’s most pliable, pernicious, irrepressible myth.

Rehtaeh Parsons Was the Most Famous Victim in Canada. Now, Journalists Can’t Even Say Her Name.

Mitt Romney May Be Weighing a 2016 Run. That Would Be a Big Mistake.

Amazing Photos From Hong Kong’s Umbrella Revolution

Transparent Is the Fall’s Only Great New Show

The XX Factor

Rehtaeh Parsons Was the Most Famous Victim in Canada

Now, journalists can't even say her name.


Lena Dunham, the Book

More shtick than honesty in Not That Kind of Girl.

What a Juicy New Book About Diane Sawyer and Katie Couric Fails to Tell Us About the TV News Business

Does Your Child Have Sluggish Cognitive Tempo? Or Is That Just a Disorder Made Up to Scare You?

  News & Politics
Sept. 29 2014 11:45 PM The Self-Made Man The story of America’s most pliable, pernicious, irrepressible myth.
Sept. 29 2014 7:01 PM We May Never Know If Larry Ellison Flew a Fighter Jet Under the Golden Gate Bridge
Dear Prudence
Sept. 30 2014 6:00 AM Drive-By Bounty Prudie advises a woman whose boyfriend demands she flash truckers on the highway.
  Double X
Sept. 29 2014 11:43 PM Lena Dunham, the Book More shtick than honesty in Not That Kind of Girl.
  Slate Plus
Slate Fare
Sept. 29 2014 8:45 AM Slate Isn’t Too Liberal, but … What readers said about the magazine’s bias and balance.
Brow Beat
Sept. 29 2014 9:06 PM Paul Thomas Anderson’s Inherent Vice Looks Like a Comic Masterpiece
Future Tense
Sept. 30 2014 7:36 AM Almost Humane What sci-fi can teach us about our treatment of prisoners of war.
  Health & Science
Bad Astronomy
Sept. 30 2014 7:30 AM What Lurks Beneath The Methane Lakes of Titan?
Sports Nut
Sept. 28 2014 8:30 PM NFL Players Die Young. Or Maybe They Live Long Lives. Why it’s so hard to pin down the effects of football on players’ lives.