The other day, after having secured a comfy corner chair in my favorite cafe, I had an exciting idea: What if you would apply the agile programming methodology to the burgeoning field of data science?
What had occurred to me was that none of the data science projects I have been involved with over the past years had started out with a solidly engineered normalized and referentially constrained database schema to store the data and with an algorithm of equally thorough design to perform analyses on them. Rather, there had always been a simple, functional prototype at the outset, say, a spreadsheet with macros from a domain expert or a CSV file and a script from an IT expert. As the project evolved, over many iterations, and often with a growing team in which domain and IT experts joined forces, the data set and the analysis algorithm had grown more sophisticated, and so had the tools and technologies that were applied in the process.
To conceptualize this workflow in terms of the agile software development framework with its focus on working code and iterative change in cross functional teams seemed like a very appealing idea to me; in fact, I realized that, having worked in agile developer teams for years, I had intuitively managed my data science projects the “agile way”.
It did not take long to find out that I was not the first to have this idea. On the contrary, “agile data” seems to be actually quite hip these days, with a dedicated web site and a whole book devoted to agile data science. Fascinated, I embarked on a survey of the available web resources on this topic; I was particularly curious to learn which tools the other practitioners of agile data science are using.
Not surprizingly, most of the agile data science folks seem to follow the established best practices of the general data sciences field and subscribe to the Hadoop ecosystem. However, I was pleased to see that a number of agile data scientists have, like myself, put KNIME, an Open Source data workflow analytics platform, at the center of their data analysis environment (see this article for a good introduction).
In my experience, KNIME is uniquely positioned for large data science projects because it allows all members of the team, IT and domain experts alike, to work together within the same environment. The IT experts can provide access to structured data using the database connector or, preferably, a web service[1], and the domain experts can choose from a huge collection of data manipulation, analytics and visualization nodes to generate the results they are looking for. There is also the option to delegate more complex or computationally expensive processing steps to R or Python, the two most widely used languages by data scientists. Finally, KNIME also offers a number of ways of scaling out to cope with large data volumes, including a (commercial) connector to Hadoop/HDFS file systems through Hive.
It seems almost as a contradiction to the spirit of the agile method to argue that all members of an agile data science team should use one particular tool – after all, the Agile Manifesto tells us to value “Individuals and interactions over Processes and tools”. However, KNIME is not so much a tool as an integration platform that gives you enormous freedom with regard to how you get the job done while making it transparent to everybody in the team what happens to the data at every step in the workflow. It is this combination of flexibility and transparency that makes KNIME such a great platform for agile data science projects.
To conclude, it was very revealing to me to re-frame my thinking about data science projects in terms of the agile programming paradigm and I am eager to apply these insights to my next KNIME project.
- Also see this blog post on using KNIME and REST for Life Sciences Discovery Informatics applications.↵