Peggy Bustamante is a news app developer with Digital First Media’s Data Team, and was the P5 Resident at ProPublica in January. She spent her time working with ProPublica News App Editor Scott Klein on mapping ProPublica's tech setup. Scott wrote a blog post about ProPublica's setup, and Peggy wrote this post, about alternative scenarios and DFM's own approach.
The first thing you want to do when you join a news apps team is build cool projects.
But if you are a new outfit, as we are at Digital First Media’s Thunderdome data team, there is one big step before you can get to that happy place of creativity and news: you have to set up an environment to build those cool projects.
My visit to ProPublica as a P5 fellow last month coincided with the crucial moment of our teams’ deciding how that development environment would be configured. On my first day, Scott Klein and I discussed ProPublica’s set up and what some other news orgs had chosen. It spawned a rather hearty discussion on NICAR-L. When the dust settled, we found some commonality among news devs and a variety of viable options.
Almost without exception, news apps teams are on cloud servers, not confined to building projects within the constraints of their news organization’s CMS. All agree, their work could not be done if they didn’t have a separate development environment.
While news app teams use a wide variety of languages, most use either Python/Django or Ruby on Rails. ProPublica and the New York Times use Rails. NPR and L.A. Times use Python/Django. DFM’s Data Team is starting with PHP because our developers have extensive experience building projects with it. In the coming months, we will be transitioning into Python/Django. We also use JavaScript and JSON feeds to build a variety of data-driven projects, such as our NFL playoff predictor.
ProPublica’s server setup, where the discussion began, is what I imagine is fairly standard, one that we as a team at DFM had been considering. It consists of a Varnish server in front of two development servers that are connected to database servers. ProPublica uses Amazon Web Services and Amazon’s Relational Database Service. They also have a second server for PostgreSQL because RDS does not support PostgreSQL. Lots more details on ProPublica's setup is over at Scott's post.
The Heroku Option
In the NICAR-L discussion, some recommended Heroku, especially Chase Davis, formerly of the Center for Investigative Reporting, and Ryan McNeil at Thomson Reuters. Chase Davis' blog post on Heroku is quite useful. Although originally only for Ruby on Rails, Heroku added Python support in late 2011.
One big advantage of Heroku is that there is no server configuration, as there is with AWS, and it’s easy to increase and decrease server capacity depending on a news app’s traffic.
On the downside, although deploying news apps is greatly simplified, Heroku can get expensive fairly quickly. It was generally agreed that Heroku is a great solution for smaller apps, but not for large-scale apps. A large part of the expense comes from using Heroku’s databases. CIR avoided that pitfall by using Heroku servers and putting databases on their Amazon EC2 box.
Heroku's costs can also be kept down by keeping the amount of data a news app uses to a minimum and putting in a little more effort with caching and configuration. That would allow an app with modest traffic to fit within the free account for a while.
But for news apps teams shooting for more complex apps with larger databases, another solution is...
Cooking Up Apps By “Baking”
The “baking” option, which is gaining ground, involves outputting or “baking” all the possible pages of a project to flat HTML files and then serving them from Amazon S3. The L.A. Times and NPR use this technique extensively. The advantages of this approach are that it is extremely inexpensive, and spikes in traffic aren’t a problem. And because there is no user input, security is much less of an issue.
Scott Klein, however, points out some of the disadvantages of “baking”:
Just to play devil's advocate, if your app doesn't have search or store user input, the whole thing will be in Varnish's cache (or your favorite framework's page cache) almost immediately anyway, so the resilience and performance benefits of baking out might be less than you think. And if you've got an app with lots of possible end points (say, millions of doctor payments or a national slippy map with 16 levels of zoom) the complexity tradeoff in baking out pages goes under water pretty fast. When you find an error in your map you don't want to wait three hours while your bake-out finishes to fix it.
The current rule seems to be whenever you can, bake it flat. Up to 80 percent of the projects at NPR are served this way.
What DFM’s Data Team Ended Up With
After much discussion and consideration, the DFM Data Team settled on our own version of all of the above.
As with other news apps teams, our development environment is separate from the CMS. We are also in “The Cloud,” albeit an internal cloud here at Digital First Media, which makes sense for us because the company has dozens of news entities that we serve.
Our configuration has load balancers in front of the database and production servers, which are fed by an identical staging server for testing and sharing works in progress, and a separate development server.
We have also decided to “bake” projects whenever possible, which will be even easier as we move to Python, so we can explore Ben Welch’s delicious Django Bakery.
And now we can build cool projects.