Updated 2/13/2015: Revised title because this was getting too many hits for the wrong reasons. When I wrote this I was thinking about a science about data content that I later called dedomenology (naturalist of the datum) to distinguish it from the sub-discipline of computer science called data science. Popular usage of data science as a discipline of computer science benefits from standard scrum like any other computer science project. It is the interpretation of the information content of the data that doesn’t fit well with agile thinking because the issues never really get resolved.
In my most recent assignment, our project was primarily about understanding available data and how to improve its handling through the project. To do this we committed to using agile practices of a regular sprint intervals using a scrum life cycle of a planning meeting, daily scrum meetings, a demonstration to stake holders, and a retrospective.
The process did make progress in various iterations. The products were primarily in the form of documentation of data and plans for a future software development effort. The stake holders agreed we were making reasonable progress. And yet I didn’t think the scrum model was appropriate at all.
For example, although new documentation were added each scrum cycle, each documentation product was incomplete and needed more research. It is conceivable that the missing information could be placed on the product backlog to be tackled in future sprint cycles. But the demonstration and planning stage added little value to refine that kind of backlog. We knew what needed to be documented from the start and nothing in the demonstration and planning exposed new requirements. All that demonstration and planning did was to inform the product stakeholders of what still needed to occur.
Keeping stakeholders informed of progress is not unique to sprints. A periodic review accomplishes the same goal. Without the overhead of scrum techniques, such a review could be relatively quick to compare progress with the already known list of what was needed. This fits well with the older so-called waterfall approach.
Scrum practices added unnecessary overhead. The biggest burden of scrum came from the simple difficulty of scheduling the meetings when everyone can attend, ideally in person.
In ideal scrum teams, the team is collocated in the same area and often seated in adjacent work spaces. Ideally the team is fully devoted to the sprint tasks and shielded from any distractions. In this scenario, the daily scrum scheduling should be a non-issue. This is a reasonably common scenario for software development teams.
Data science may involve some software activities, but it also involves a lot of research of the form of getting out of your seat and seek information. The very nature of that research requires more ad-hoc teaming arrangements where certain specialists are participating only part time and from different areas. In this context, the mere scheduling of even a daily scrum becomes a challenge.
But more importantly, there was not really much to cover during the daily scrum. Research doesn’t have clearly defined daily milestones that can be tackled and set aside. Most of the meetings where repeated assertions that progress continues to be made. We did bring impediments, but the impediments were not news to anyone and no one on the team was in a position of solving it. For example, one impediment was that a particular effort was waiting for the policy makers to publish a document that formally set a policy. That is interesting to discuss, perhaps one time. Unlike software sprints where a task may be placed back on the backlog until all resources are available, this one required some continuous level of participation in meetings concerning the progress of that external document.
That last example illustrates the problem of pretending to make data science tasks into sprint tasks. There was a real task to get firm information about a particular policy. That task required continuous participation to be aware of the progress of the external activity. This well defined and easily understood task simply doesn’t fit in a sprint cycle. It will span many sprint cycles and it may not align neatly in the sprint boundaries. For example, the long awaited document may finally be published in the middle of a sprint that did not anticipate this development during its planning phase. The sprint concept encouraged being blind to such unexpected opportunity until the next scheduled planning phase for a future sprint.
There is more flexibility in the management of sprints than I am implying. I am suggesting that the process would have run more naturally if we didn’t have to organize our activities into fixed time boxes of sprints. The old Gantt-chart project management approach showing asynchronous start and stop times for different efforts would work fine. We were in effect trying to cut the Gantt-chart vertically so as to fit into time-boxes that served as some sort of shipping containers. This can be done, but it didn’t seem to be productive.
Another problem of the sprint approach was to impose some concept of a finished product at the end of each sprint. The ideal goal is that at the end of a sprint, the product owner would have something complete that the owner had the opportunity to use immediately. In the context of researching data issues, this amounted to composing some self-standing document of what is currently understood and then closing that document so that it could be used even though everyone recognized it was highly unlikely to be useful.
Normally we would review a document that is a work in progress with no pretense that it is in a form that could be in some sense sold by the product owner. Instead, we attempted to make some kind of appendix document that will eventually be combined with other appendixes to a final document somewhere. It was difficult to even organize a sprint time-box of progress into a final version of an appendix. There always remain some details that need more research. The older pre-scrum concept of a progress report would have served the same purpose without the composition burden of making a document appear like a product to place in a catalog.
We ended up with multiple copies of the same document stored in a repository in the same fashion as code repositories. The older version would be checked in at the end of its sprint, a new copy would be checked out for modification in the new sprint. This uses software practices to manage documents and there is merit to being able to go back and see earlier versions unmodified since that earlier time. However, software projects get an additional benefit of being able to roll back to an earlier version of the design in case that a later development needs to be abandoned. This type of scenario doesn’t really occur for research. There is little productivity to gain be being able to roll-back to an earlier understanding of the document. We are documenting human understanding of a topic. We may have made some recent efforts that need to be abandoned but we inevitably made some progress in better understanding the issues.
With software, reverting to an earlier version is to discard everything learned since then. It is an emulation of time-travel back to a time that pretends something never really happened. This makes sense for software, but less sense for analysis or research. Editing the current document to incorporate recent lessons learned may be more efficient than finding an old version of the document, checking it out, and then revising the structure to somehow capture this new lesson.
The successive sprints began to accumulate documents that were artificially created to give some sense of a sprint product iteration. Again, the sprint ideal is that the product at the end of the sprint was a releasable product. The documents were meant to be publishable, not merely hidden way in a document-change-control repository. The documents had the appearance of being complete and they would circulate. This caused more disruption as newer information completely changed some earlier conclusion. I think this circulation would not have occurred or certainly would have been more discouraged if there were no attempt to try to package a time-box of learning into a demonstrable product.
Working with large data projects implicitly means working with a great number of different groups who are responsible for different aspects of the data. Even for a particular type of data, there different groups for defining the data, for measuring it, for collecting it, and for distributing it. The effort to fit this type of problem into time-boxes immediately confronts a major impediment of scheduling meetings to get all of the relevant people to be available at the same place and the same time to cover some issue. Big data projects don’t have the luxury of dedicating these participants to an isolated scrum teams whose schedules can be easily controlled or predicted. Sometimes the scrum team can organize some essential meeting to occur when they need it to occur. But more often, the essential meeting occurs unexpectedly in context of what was committed for a sprint time box.
Because of the unique opportunities of these events, we must make immediate changes in our plans to attend. We can not insist that the meeting be postponed until a future sprint cycle. Attending suddenly announced meetings is contrary to spirit of scrum where the team works in isolation with a scrum master providing a gatekeeper to guard against external distractions. In data science projects, progress occurs during events that would be considered a distraction within a sprint. An undesired distraction in terms of sprint task activities is a highly desired milestone for a data science activity. Again, the older waterfall or Gantt-style project management approach would have made more sense.
We tried to revise the process to incorporate agile software development concepts with the hopes of enjoying some of the same benefits. We congratulated ourselves for our innovations of seeing different ways to organize research into sprints. Part of our mission was to adapt to newer techniques. We were adapting.
In terms of productivity, however, we experienced more burden than benefit. There is a place for agile techniques especially in the area of application of knowledge and available technologies. There also remains a place for older practices where the focus is on accumulating knowledge and technologies. In my mind the difference is between software science and data science. Agile for software, waterfall approaches for data.
I note that there is an argument for a hybrid model playfully named water-scrum-fall. In fact, we followed a hybrid approach by maintaining two separate backlog lists: a product backlog and a sprint backlog.
We had a product backlog list of essential capabilities that will not be ready for a long time. We managed the product backlog using older project management techniques as would be expected in the older waterfall approach.
We had a sprint backlog list of specific well defined tasks that reasonably could be expected to be completed in a single sprint time-box. Once completed, such tasks would never have to be revisited in any future sprint. The sprint task tied back to some product task but the product task did not have an exhaustive list of sprint tasks. The sprint review period revisited the product task to identify new sprint tasks where we found areas lacking.
This hybrid approach was workable but it caused a lot of unnecessary confusion in communicating objectives both to the scrum team for the current sprint and to the product owners for the overall product.
It would have made more sense to divide the project into a software project following a more pure scrum approach, and a data-science project following a more pure waterfall approach. First, we need to agree that there is a distinct difference between the two endeavors of software development and data science.