Blogging the Stanford Machine Learning Class
Entry 5: Could I start using this stuff to create an automated journalist?
Illustration by Paul Fleet/iStockphoto.
It occurred to me halfway through this week’s machine learning lectures that I could actually use the stuff we’re learning for something other than the homework assignments. I was reminded of a poster hanging in my seventh-grade math class titled “When will I ever need to know this stuff?” with a long list of professions and how they made use of trigonometry, algebra, fractions, and so forth—all of which was irrelevant to me at the time since I was headed to play for the Philadelphia Phillies.
Since I got stuck in journalism instead, I’ve been thinking about how machine learning might be of use to the field. We’re still on the subject of neural networks, which use a tremendous series of matrices and logarithms to simulate the way your brain learns. All of this would have been extremely useful when my friend Farhad Manjoo and I were trying to write a program called Robottke for Slate that would predict which Web pages popular blogger Jason Kottke would link to in a given day.
All the machine-learning algorithms we’ve covered in class so far have essentially worked the same way: First, you identify a set of real-world data that you’d like to “learn”—that is, simulate in a model. The operating example we’ve been using is housing prices, in which you have, say, 5,000 examples of houses with information like square footage, median income of the neighborhood, number of bathrooms, and so forth, as well as how much they sold for. There’s no limit to how complex you can get, up to hundreds or thousands of data points per house. This is known as your “training set”—the information you’ll feed to your algorithm so that it can slowly adjust its many parts—that is, “learn” the quirks of the housing market. At the end, you’ll be able to test its quality by running all these real-world examples through it and seeing how well it does at getting the right answer, or an answer close to the right one. If it does well, you hope you can trust it to predict prices for homes not yet on the market.
In the case of Robottke, we already had a very thorough “training set” of real blog posts by Kottke going back five years, with information like how often he would link to a given site, what keywords he likes to assign to his posts, and which bloggers and Twitter users he frequently turned to for ideas. Lacking any experience in machine learning, our algorithm was blissfully crude, assigning weights to these factors somewhat arbitrarily until the results looked reasonably accurate. (Essentially, we were doing the same thing a machine-learning algorithm does, just very badly.)
Had I taken this class first, I might have been able to construct a neural network to take in all this information we had and truly algorithmize Kottke to the point that, every day, I could feed Robottke with thousands of links he might be interested in linking to, casting a wide net, and getting back a “yes” or a “no” from the program about whether to include them in Robbotke’s daily feed. I emphasize “might” because it’s a lot easier to do this in the controlled environments that professor Ng provides us for the programming assignments, complete with helpful hints about which programming functions to use and benchmark values to make sure we’re on the right track. But in theory, this isn’t necessary. One of the beauties of machine learning is that it’s easy to tell whether you did a good job, because when you feed your original training data back into the finished product—the data for which you already know the correct answer, since it came from real life—it should predict the answer correctly at least 95 percent of the time. (Curiously, if it predicts the right answer 100 percent of the time, you should worry—it may mean your algorithm is too neatly tailored to the precise training examples, and therefore unequipped to handle data it’s never seen before.)
The possibilities abound. Newspaper and magazine websites are always trying to figure out ways to make readers maximally likely to click on stories and then click on more stories. There is some empirical research going on in this field, like randomly serving readers one of two headlines to see which is more enticing, but a machine could do much better. And tell us something about what readers really want in the process.
Review Questions: 100/100 (programming exercise)
*Note: Both grades docked 20 percent since I missed the deadline, for which I received an automated email reprimand. I’ll get back on track, professor Ng!
Chris Wilson is a Slate contributor.