Blogging the Stanford Machine Learning Class

Could I Use the Stanford Machine Learning Class To Create a Robot Journalist?
What's to come?
Nov. 16 2011 3:42 PM

Blogging the Stanford Machine Learning Class

VIEW ALL ENTRIES

Could I start using this stuff to create an automated journalist?

Robot writer.

Illustration by Paul Fleet/iStockphoto.

It occurred to me halfway through this week’s machine learning lectures that I could actually use the stuff we’re learning for something other than the homework assignments. I was reminded of a poster hanging in my seventh-grade math class titled “When will I ever need to know this stuff?” with a long list of professions and how they made use of trigonometry, algebra, fractions, and so forth—all of which was irrelevant to me at the time since I was headed to play for the Philadelphia Phillies.

Since I got stuck in journalism instead, I’ve been thinking about how machine learning might be of use to the field. We’re still on the subject of neural networks, which use a tremendous series of matrices and logarithms to simulate the way your brain learns. All of this would have been extremely useful when my friend Farhad Manjoo and I were trying to write a program called Robottke for Slate that would predict which Web pages popular blogger Jason Kottke would link to in a given day.

All the machine-learning algorithms we’ve covered in class so far have essentially worked the same way: First, you identify a set of real-world data that you’d like to “learn”—that is, simulate in a model. The operating example we’ve been using is housing prices, in which you have, say, 5,000 examples of houses with information like square footage, median income of the neighborhood, number of bathrooms, and so forth, as well as how much they sold for. There’s no limit to how complex you can get, up to hundreds or thousands of data points per house. This is known as your “training set”—the information you’ll feed to your algorithm so that it can slowly adjust its many parts—that is, “learn” the quirks of the housing market. At the end, you’ll be able to test its quality by running all these real-world examples through it and seeing how well it does at getting the right answer, or an answer close to the right one. If it does well, you hope you can trust it to predict prices for homes not yet on the market.

Advertisement

In the case of Robottke, we already had a very thorough “training set” of real blog posts by Kottke going back five years, with information like how often he would link to a given site, what keywords he likes to assign to his posts, and which bloggers and Twitter users he frequently turned to for ideas. Lacking any experience in machine learning, our algorithm was blissfully crude, assigning weights to these factors somewhat arbitrarily until the results looked reasonably accurate. (Essentially, we were doing the same thing a machine-learning algorithm does, just very badly.)

Had I taken this class first, I might have been able to construct a neural network to take in all this information we had and truly algorithmize Kottke to the point that, every day, I could feed Robottke with thousands of links he might be interested in linking to, casting a wide net, and getting back a “yes” or a “no” from the program about whether to include them in Robbotke’s daily feed. I emphasize “might” because it’s a lot easier to do this in the controlled environments that professor Ng provides us for the programming assignments, complete with helpful hints about which programming functions to use and benchmark values to make sure we’re on the right track. But in theory, this isn’t necessary. One of the beauties of machine learning is that it’s easy to tell whether you did a good job, because when you feed your original training data back into the finished product—the data for which you already know the correct answer, since it came from real life—it should predict the answer correctly at least 95 percent of the time. (Curiously, if it predicts the right answer 100 percent of the time, you should worry—it may mean your algorithm is too neatly tailored to the precise training examples, and therefore unequipped to handle data it’s never seen before.)

The possibilities abound. Newspaper and magazine websites are always trying to figure out ways to make readers maximally likely to click on stories and then click on more stories. There is some empirical research going on in this field, like randomly serving readers one of two headlines to see which is more enticing, but a machine could do much better. And tell us something about what readers really want in the process.

Grades*: 5/5
Review Questions: 100/100 (programming exercise)
*Note: Both grades docked 20 percent since I missed the deadline, for which I received an automated email reprimand. I’ll get back on track, professor Ng!

TODAY IN SLATE

Politics

Blacks Don’t Have a Corporal Punishment Problem

Americans do. But when blacks exhibit the same behaviors as others, it becomes part of a greater black pathology. 

I Bought the Huge iPhone. I’m Already Thinking of Returning It.

Scotland Is Just the Beginning. Expect More Political Earthquakes in Europe.

Lifetime Didn’t Think the Steubenville Rape Case Was Dramatic Enough

So they added a little self-immolation.

Two Damn Good, Very Different Movies About Soldiers Returning From War

Medical Examiner

The Most Terrifying Thing About Ebola 

The disease threatens humanity by preying on humanity.

Students Aren’t Going to College Football Games as Much Anymore, and Schools Are Getting Worried

The Good Wife Is Cynical, Thrilling, and Grown-Up. It’s Also TV’s Best Drama.

  News & Politics
Weigel
Sept. 20 2014 11:13 AM -30-
  Business
Business Insider
Sept. 20 2014 6:30 AM The Man Making Bill Gates Richer
  Life
Quora
Sept. 20 2014 7:27 AM How Do Plants Grow Aboard the International Space Station?
  Double X
The XX Factor
Sept. 19 2014 4:58 PM Steubenville Gets the Lifetime Treatment (And a Cheerleader Erupts Into Flames)
  Slate Plus
Slate Picks
Sept. 19 2014 12:00 PM What Happened at Slate This Week? The Slatest editor tells us to read well-informed skepticism, media criticism, and more.
  Arts
Brow Beat
Sept. 19 2014 4:48 PM You Should Be Listening to Sbtrkt
  Technology
Future Tense
Sept. 19 2014 6:31 PM The One Big Problem With the Enormous New iPhone
  Health & Science
Bad Astronomy
Sept. 20 2014 7:00 AM The Shaggy Sun
  Sports
Sports Nut
Sept. 18 2014 11:42 AM Grandmaster Clash One of the most amazing feats in chess history just happened, and no one noticed.