Book: Agile Data Science 2.0
Author
Russel Jurney
Summary
Instructions for a technical setup to iteratively develop practical Data Science applications.
Takeaways
Many Data Science applications fail because of a missing feedback loop between the Data Scientists developing the solutions and the business stakeholders and users. To avoid a disconnect, Data Scientists need to share work in progress frequently. Software development methodolodies like Scrum need to be adapted to account for the larger uncertainty of data exploration.
An Agile Data Science process needs to leave room for experimentation and variable goals. Instead of providing the ship date of a predetermined artifact, an Agile Data Science team should produce working software that describes the state of exploration (“What will we ship, when?” instead of “When will we ship”).
Quotes
“A researcher who is eight persons away from customers is unlikely to solve relevant problems and more likely to solve arcane problems."
“Several changes in particular make a return to agility possible: Choosing generalists over specialists. Preferring small teams over large teams. Using high-level tools and platforms: cloud computing, distributed systems, and platforms as a service (PaaS). Continuous and iterative sharing of intermediate work, even when that work may be incomplete."
“One thing we require is that every level of the stack must be horizontally scalable. Adding another machine to a cluster is greatly preferable to upgrading expensive, proprietary hardware. If you have to rewrite your predictive model’s implementation in order to deploy it, you aren’t being very agile."
“We will only explore a simple heuristic-based approach, because it turns out that in this case that is simply good enough. Don’t allow your curiosity to distract you into employing machine learning and statistical techniques whenever you can. Get curious about results, instead."