Of all of journalism’s recent evolutions, data-driven reporting is one of the most celebrated. But as much as we should toast data’s powers, we must acknowledge its cost: Assembling even a small dataset can require hours of tedious work, deterring even the most disciplined of journalists and their editors.
Fortunately, there’s an affordable -- and amazing -- tool that can make the impossible easy: Amazon’s Mechanical Turk (mTurk).
For those unfamiliar with Mechanical Turk, it’s an online marketplace, set up by the online shopping site Amazon, where anyone can hire workers to complete short, simple tasks like quickly transcribing interviews, copying data from thousands of charts, and even sorting through satellite images in hopes of locating missing individuals. Amazon originally developed it as an in-house tool, and commercialized it in 2005. The mTurk workforce now numbers more than 100,000 workers in 200 countries.
At the urging of Panos Ipeirotis, a professor at New York University’s Stern School of Business, we began experimenting with mTurk last spring to clean, de-duplicate and reformat data. We’ve since used the tool to collect or proof more than 28,000 data points, from the names of companies that received stimulus money to the categorization of answers to our home loan modification questionnaire. We’re impressed with the speed and accuracy of its results. For example, a project we estimated would take a full-time staffer almost three days to finish was completed on mTurk overnight for $37, with 99 percent accuracy.
Mechanical Turk has proven to be more than a shortcut. It has freed up staff time for more complicated work. We’ve also used it to retrieve data from government databases that prohibit scraping.
We’ve summed up our knowledge of the tool and lessons learned in this guide, “ProPublica’s Guide to Mechanical Turk.” A lot of credit also goes to Professor Ipeirotis, who has answered many of our questions and reviewed our initial mTurk projects.
Got questions? Send them our way or post them below. Using mTurk in your data-driven journalism projects, or have some mTurk expertise to share? Compare notes in the comments below.