Data as tomatoes

Tomatoes are a very versatile fruit.  When it is first ripe on the vine, we can pick and and eat it whole, slice it for sandwiches or cut it for adding to salads.  A fresh tomato at peak of ripeness is a treat.

When we pick out loose tomatoes in a produce section, we have the intention of using the tomato in the same way as we would one that is fresh from the vine.  But it is a little different.  To have it in the produce section means that it was picked before it was fully ripe and it ripened during the period to get to produce section.   It is still usable as a fresh tomato but has a slightly different quality.

A fresh or loose tomato has a very short shelf life.  There is only a short period of time you can use a tomato for the purposes you would use a fresh tomato.   If we don’t use the tomato before that time, then we seek out a fresher tomato instead of using the old one.

A tomato may be preserved.  It can be canned.  Canned tomatoes have a much longer shelf life.  Canned tomatoes have their own uses and in some cases may be preferable to fresh tomatoes for use in recipes.    On the other hand, a canned tomato is not a substitute for fresh tomatoes.  At least personally, I never consider slicing a canned tomato for a sandwich, I may have tried it once but certainly it is not an experience I want to repeat.   The preservation of the tomato changes it into a different kind of commodity, useful in its own right but not a real replacement for what it once was.

When I was growing up, we had a large garden that produced large numbers of tomatoes.   Really too much to eat, and even too much to can.  You may say we had a big tomato crisis: too many tomatoes.  We turned to big tomato solutions.

We boiled down large numbers of tomatoes into sauces.  When canned, tomato sauces have a very long shelf life.  Tomato sauces are very useful for a variety of recipes and superior to fresh tomatoes in that you don’t have to spend a lot of time cooking down the tomatoes to the desired consistency from the can.  You are free to argue that cooking down fresh tomatoes could result in a better flavor, but canned sauces are certainly faster to use.

A small container of sauce can represent what was originally a large number of tomatoes.  The tomatoes are in some sense preserved for longer period and the product has its own uses.   One use we don’t expect from sauce is to try to reconstitute one of the original tomatoes.    Even if it were possible, I can’t imagine it ever coming anything like a canned tomato let alone a fresh one.  We don’t bother because the sauce is valuable in its own right.

Back to the big tomato problem was our ultimate solution.  We boiled down the tomatoes into ketchup. I’m too lazy right now to look up the reduction ratio involved, but to this day I was impressed how disappointed we were in not filling all the jars we had hoped to fill after boiling down a huge pile of tomatoes.   Ketchup is a unique and valued treat.  It represents a large harvest of tomatoes but its use is primarily for what only ketchup can do.

We never went to the next step to make tomato paste.

I mention this as an analogy for how I view data.  Although data doesn’t physically transform over time, its value and use does depend on its age.   Fresh data is used in some transaction or process.  The data may persist after that event, but fresher data is needed for a future transaction.   This is like my fresh tomato, enjoy its freshness in the short time that it exists as fresh.

Data can be archived, sometimes for a long time.  Archived data is not a replacement for fresh data, but it is used for other purposes.  I compare it to the canned tomatoes.   A canned tomato does preserve some integrity to the tomato so there is at least some promise to use it just like a fresh tomato, but it takes some imagination to convinced to be satisfied with the substitution.  Likewise, we use archive data for purposes we would not use fresh data.

Usually data is reduced into the equivalent of a sauce, or mapped and reduced into an equivalent of ketchup.   At this point the data is useful for entirely different purposes than the original data.  Usually the reduction retains some statistical property of the original data, but this is analogous to the tomato sauce retaining the pulp part of all of the original tomatoes.   It serves a different purpose entirely from what the fresh data was used for.

In some ways it is unfortunate that data doesn’t have a physical manifestation so that we can see it in different containers and see physical hints about its age.   It is very tempting to see archived data as a substitute for fresh data, especially when the archive data is more convenient.

In the analogy of the tomato, it is easier to reach for the rotten tomato on the window sill than it is to walk out and pick a fresh one.  But the circumstances motivates us to go to the extra effort.   Data doesn’t show its rot.

When it comes to data, the freshness of the data relates to its relevance to individual transactions, and in particular individual actions.  Even data representing messages from individuals are only relevant as indicators of the individual’s thinking at the time it is transmitted.  This is obvious in the case of making an oral presentation or debate: the actual content is specific for that event.  A recording of the oral argument may have some use as entertainment or education at a later time, but that recording lost its relevance as characterizing the speaker’s views or the speaker’s mind.   A lot of electronic communications including voice and text messages are actually oral communication transmitted by a different method.

We should reject the idea of using an old recording as a substitute for talking to the person directly.  A recording has a different kind of purpose that may be useful, but the one thing it lost was its freshness.   When you want to know about a person, there is no substitute to getting to know them right now.

I guess the one of the areas where this annoys me the most is the use of recorded oral-equivalent data to characterize views of a political candidate.   Lately, this is in the form of discovering old e-mails, or (increasingly) old tweets.   These are equivalent to oral messages that only have meaning about a particular conversation not about a person.   While it is fair to use carefully composed works such as books, published papers, or official documents, it is far less fair to use more ephemeral messages from the distant past to characterize a person.   It is not possible to reconstruct the context of that information and relate that context to the current situation.   Unfortunately, it is exceedingly tempting to do just that.  It is especially popular tool for political attacks often with undeserved success.

Politics is an analogy to replace my tomato analogy.  My bigger concern is with similar abuses elsewhere.

Big data is very useful and especially when it is mapped-reduced.  But all to often, it is used instead to as a lazy alternative to obtaining fresh data.

Addendum: I couldn’t resist going back to my tomato analogy.  Over-aged tomatoes are sometimes used to throw at people you don’t like.  Now we throw over-aged data.


