In my last post, I observed the emphasis on velocity in big data is related to the project of exploiting big data as a feedback loop to interact with the current world. In this sense, the big data project is an operational system that uses data in real time. Unlike more classical operational systems that specify requirements for sensors in order to meet the objectives, big data uses available data from pre-existing sources, often reusing data that was meant for another purpose.
Modern technology makes possible the rapid retrieval and processing of these remotely relevant data sources in order to use that data for a different operational project. My observation was that this operational objective becomes more difficult if we demand that the data sources be trustworthy witnesses of the facts in the real world. Velocity and Veracity are competitors.
The classic approach to developing operational system resolved the conflict of velocity and veracity by developing requirements that data sources must meet in order to meet the objectives of the operational system. Once the requirements were identified, the data sources were designed or selected to meet those requirements. Part of the design process may involve some iterations to redesign the operational system to accommodate available technologies, but this would be accompanied with compensatory changes in the design or in the objectives. When the final system is operational, the required accuracy of the data sources are assured to be compatible with system’s objectives.
In practice, this may involve some self-checking and automatic diagnostics and logging to assure that the quality of the sensors continues to meet its requirements. With the classical approach to an operational system, if the sensor begins to fail to meet its requirement, there is an action to correct the sensor.
In contrast, the big data project uses remote sensors that the operational system does not directly control. These remote data sources often have no established service level contract that will guarantee that the data will continue to meet the needs of the big-data based operational system. The remote data sources can change behaviors or content without prior notice. The data sources may even change to the extent of being no longer relevant at all.
The variety of big data refers to the diverse types of data sources. Typically these data sources were originally designed or justified for unrelated projects. These data sources can have different definitions about the measurements.
When I started writing this post, I was thinking about a recent experience using Google Street view to see how my property appeared. By chance, the picture was taken in the fall immediately after the leaves have fallen from the trees, and more importantly it was taken during the few months my house was being renovated. Prominently shown in the front of a house was the contractor’s large trailer. That’s how my house appeared in “street view” for some 5 years.
The definition of street view was the view at a specific time. The picture has since been updated but even that picture is two years old. Street view means something different to Google than it does to someone walking by the neighborhood. Yet street view data is available for big data projects.
In another example, I queried my property using the county’s GIS system and it had satellite or aircraft photos that were out of date. Coincidentally, the GIS system had some automatic building-recognition algorithm that for a couple years prominently displayed a large shed in the back yard that does not and never did exist.
These are two examples that attempt to present the same information: a photographic view of my property. The views are not compatible because they are taken at different times and both of them are out of date with respect to the current reality.
In these cases, the information includes time stamps that informs the user that the pictures were not taken at the same time and that both are out of date.
In the project of my last job, I spent a lot of effort confronting the problem of matching data from multiple sources that I had no direct control over. In that project, time-stamp of data from multiple data sources was a recurring challenge.
One of the time-stamp problems was the definition of what event actually got the time-stamp. One of our measurements concerned the accumulation of a counter over a period of time. For that measurement, we were confronted with the following challenges:
- One sensor would provide the time of the start of the interval, and the other would provide the end of the interval.
- One sensor used absolute time intervals (every 5-minute point of the hour) and the other used relative time intervals (5 minutes after the last measurement) that will drift over the the months or years the system would operate.
- One sensor used the local clock of the central server collecting the observations and the other used the local clock of each entity being observed.
- Over the history of the project, there were multiple cases where these different clocks would not be synchronized with each other or with the true time. Also, there were multiple upgrades where replacement software or systems would handle time stamps differently than previous or different from its peers.
For my project, it was important to compare these two measurements. Because we had no control over these sensors, we had to adapt to changes after they became apparent. We needed to be alert to the possibility to the problem of different and changing definitions of the basic concept of the current time. We needed to quickly implement changes to adapt to these unexpected changes.
In the above examples of the property photographs, the pictures were years apart so this problem is not apparent. If by chance the two photographs had the same time stamp, we still have the problem of how time was defined by Google or the county’s source. For property photographs a difference of few minutes or even a day or so probably is irrelevant.
Consider the problem where two sources behaved like the sources in my project. Imagine for example, the Google photograph records the time of the moment that the photograph was taken while the county’s photograph records the time it received a package of all photographs taken by its provider. These two times may match but the information would be months apart. A lot can happen to a property in a month.
I also recall similar scenario when we would get bulk data feeds from an intermediate aggregating source. Behind that intermediate source were automated systems that assured us that the time stamp of the delivered package was a certain offset of time from the actual observations. We used the time stamp of the package to calculate a time stamp for the observations in the package. In technical terms, the aggregating process would run at a particular time to archive all of the files in a particular directory and deliver that archive with the time that the archive was created. One day we started to notice that consecutive archives were identical except for the time stamp. Relying only on the time stamps, we would have to conclude that nothing changed in the interim so that all observations matched exactly with the previous observations. In fact what happened was that the back end system changed the directory where new data was being stored but the archiving process continued to archive the old directory of stale data. This is another example of the historical time stamp not providing the meaning that we expected. The time was literally the time of the archive, not the time of the observation.
We assumed that we could always infer the time of the observation from the time of the archive of the of the files in that archive. We requested without success that they provide include time stamp in the individual observations but the best we could negotiate was a promise that this would never happen again.
In the above property photograph example, the county may receive a fresh package from its surveillance provider where that package contained copies of the old photographs. This error may not be noticed until some analyst recognizes that the new pictures were identical to the old ones. In other words, as in my example, the error would be caught too late.
The time stamp is an interesting problem because it is so basic and intuitive that we often just take it for granted. The observation is the measurement that happens to use a time stamp. We often overlook the fact that the time stamp itself an observation.
We want time to behave like it does in the real world. In the real world, time is an absolute dimension that supposedly can be traced back continuously to the time of the big bang with consistent nanosecond (or finer) resolution. We want observations to treat time as an independent variable that we can all agree on.
In diligently designed operational real-time systems, there is usually an attempt to impose a requirement that all participating sensors use time consistently. This is also how we relate to each other in current events. For example, we can arrange a meeting at a specific time and be fairly certain people will come together at that assigned time. In the present moment, absolute time is not ambiguous (at least the scale of human experience).
For historical data, however, time is much more ambiguous. We see a time stamp. The time stamp uses the correct notation to represent time exactly as we would use it in the present moment. The problem is that there is some question concerning exactly what this time means. This is especially true for time stamps from a variety of sources that we have no control over.
The variety aspect of big data often refers to using remote data sources that have no prior commitment to support the big data project. In this sense, big data share with historical data the problem of ambiguous definition of time. In my experience, the remote sensors can change without notice their definition of the meaning of the time stamp. For solutions that require veracity of data, there is a need for labor-intensive monitoring of the data and quick response adaptation to adjust algorithms for changes in the time measurements. This labor in the loop requirement for veracity can limit the velocity for use in real time operation.
The best way around it is to impose a single definition of time stamps for all sensors where that definition is consistent with this particular project’s needs.
The problem for big data is that the sensors are used by multiple independent big data projects each with their own requirements that might desire different definitions of time. The above discussion covers just one possible problem, but a problem involving the most basic piece of information: the time stamp. For practical reasons, the big data project must accept the data from its remote sensors as it is provided and then adapt accordingly. If the big data system is used in an operational sense to make automatic decisions to interact with the real world, then there is a risk for a period of time when it might be wrong because the data is no longer valid.
As mentioned in an earlier post, sometimes there is an unwritten acceptance that occasionally data will disappoint the big data project. It is the price they choose to pay for velocity and variety as an alternative of paying for veracity.