Open Access News

News from the open access movement


Wednesday, April 02, 2008

The power of more data

Anand Rajaraman, More data usually beats better algorithms, Datawocky, March 24, 2008. (Thanks to John Wilbanks, via Slashdot.)

I teach a class on Data Mining at Stanford. Students in my class are expected to do a project that does some non-trivial data mining. Many students opted to try their hand at the Netflix Challenge: to design a movie recommendations algorithm that does better than the one developed by Netflix. ...

Different student teams in my class adopted different approaches to the problem, using both published algorithms and novel ideas. Of these, the results from two of the teams illustrate a broader point. Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better?

Team B got much better results, close to the best results on the Netflix leaderboard!! I'm really happy for them, and they're going to tune their algorithm and take a crack at the grand prize.  But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I'm often suprised that many people in the business, and even in academia, don't realize this. ...

The OA connection, from commenter "Plausible Accuracy" on Wilbanks' blog:
... This is a great example of how "mashups" ... can be used to sort of bootstrap the power of a dataset. In the case of the Stanford teams, the incorporation of data from an external source enabled them to improve their algorithm. In the case of Open Access science, the ability to better combine data from a variety of studies and fields will in turn lead to more discoveries.