How I would define data science

One of the more memorable quotes from an acquaintance came from someone who was managing a project involving science (wetlands preservation modeling, if I recall correctly).   His project was almost entirely about programming software simulations.   He explained that he expected his staff to know how to figure out software just like he expected them to know how to drive a car or to dress themselves in the morning.    Software may take a little more skill preparation, but listing such skill on a resume was about as informative as listing ones wardrobe.

To be fair, this was decades ago when FORTRAN still held considerable ground.   We connected because of our shared interests in that language.    Perhaps his thinking would be different today with the incredible breadth of software languages and associated libraries.   Today, there appears to be great marketability to have a decade of experience in a particular library for a particular language.   To me, it is like boasting a decade of experience of dutifully showing up to work in appropriate attire.

The point however was not that software was easy.   We agreed software was not easy.   The software had to run quickly with limited resources.   The software needed to earn confidence that was able to correctly perform extensive multiple factor simulations.   The staff using this software put their personal reputations as scientists behind the results of the simulations when making presentations to the client.

I guess the biggest difference between then and now is that at that time the client expected actionable results.   The client did not want to see the software and for the good reason that there really wasn’t anything to see.   The user interface was a few command line statements that identified file names that held various model and statistic data.    Report-authoring tools were available to produce presentation quality outputs, but each report involved substantial manual labor to set up.  The client had as little interest in watching that activity as he had in the analysis software.

Today the focus is entirely on the software itself.    The goal is for the user to run the software from start to finish.   The modern project emphasis is on being able to present the best interface for enabling that user.    The recent agile software trends are focused primarily around the software that everyone can see rather than the stuff that only the software writers will see.   There are semi-monthly or monthly milestones where the user-interface advances are demonstrated to the product stake holders.    To be sure the demonstrations show functionality other than mere user input and presentation, but much more effort is placed in the user interface than in the functionality.

In the first example, I described scientific modeling projects where the focus was on mathematical or statistical models of nature.   The emphasis was on getting those models correctly programmed and to assure they are used appropriately for the available data or the intended results.   In effect, the staff would get no credit for the software itself beyond the fact that it had to work in order for him to complete the job.

I look at data science in a similar way.   I think of the idea of data as a scientific concept very much like the above mentioned environment sciences.   Data is something that can be studied and scrutinized.   In earlier posts, I proposed a way to put data into different categories in terms of what we can trust about the data and what we need to be cautious about.    I distinguish observation data from model-generated data.   I distinguish relevant data from irrelevant data.

To me, data science is this effort of getting to know the data, to know all about where it comes from, how it got to me, what it is telling me, what could go wrong, and how it relates with other data.    I suggest that this type of attention is required throughout the project life-cycle.   Obviously, data science is needed during the design, development, and test phases.   I’ve have been emphasizing the less obvious need that for continued significant investment in data science during the operational phase.

In some sense, data is like nature itself.   Returning to the analogy to the modeling of the environment of a body of water, there is always the possibility of the unexpected introduction of a new factor or consideration.  Perhaps a landslide introduces new minerals or pollutants.    This can happen at any time and may not have been anticipated in the original design.   Again, the analyst (who happens to write software) puts his reputation on the line that his presented results are relevant and up to date with reality.

I don’t see this emphasis as much in today’s projects.   I guess a reason why is that we have separated the disciplines of software from analysis.   Analysts are software users who have no need to know software.   Mysterious to me is the recent reputation that software has become so delicate it needs highly trained hands to do correctly.   The agile processes places the software developers on one side emerging from their sprints to demonstrate to the analysts the new functionality that they meticulously captured in fresh software.   The analysts in turn scrutinize the ease of use of the interfaces in terms of their jobs.

Sometime I wonder if the actual utility of the software falls between the cracks.  Is the correctness the data the responsibility of the analyst or of the developers?   Sometimes I wonder if it is even really discussed in those terms.    Software developers demonstrate something new and analysts see that they can use that new feature.

My use of the term data science may place me in the minority.  More popularly data science emphasizes software as if the challenge is entirely on learning the software.   It also emphasizes years of experience for specific software.  Because the software designed to be learned and used quickly, I guess the experience is an indication that that the data scientist has actually engaged in the science of getting to know the data.   In my opinion, it is just as likely that the person has used the software for years pretty much the same as he did the first day he learned it.

Being able to scrutinize data as a science and practice of its own.  Data science is a distinct skill from using tools.   It is impossible to scrutinize data at that level without being able to use those tools.  But using those tools doesn’t say anything about the user’s skills in diligent data science.

In short I share the old fashioned FORTRAN project philosophy.   I assume the ability to use tools as equivalent to the ability to get dressed for work each morning.   The real question is whether one can do his job to the point of putting his personal reputation on the line that the presented results are unambiguously representative of the real world with confidence that a decision maker can act on this information.    That’s not a sprint product-demo activity.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s