In my view, the goal of data analytics is to learn something new from the data. While nearly everyone else will say it the same way, I have a particular interpretation of the process of learning new things.
I believe the human brain is given too much credit for its reasoning capacity. Much human achievement comes from a simple innovation of having the ability to record our language symbolically in a durable form so that information may reliably be passed through generations. The recorded information becomes another source of perception that is unavailable to other life forms. Without this perception, human intelligence would be more indistinguishable from other creatures. Certainly, there would be differences in certain areas, but those differences work both ways: in some tests other creatures out perform humans while humans outperform the same in other tests. Without written language, human development would not have been able to achieve what it has done. Certainly, there were successful human societies without written language, but they rarely are able to sustain for more than a couple generations because of the lack of information transfer from the now-dead to the living.
Humans have the perception of communication from the no-longer living. I hesitate to say that no other animal has this ability, but the durable record of symbolic language is especially efficient and also very conveyable. Humans can perceive messages sent from individuals unrelated by all but a shared interest in a topic.
In my thinking, our intelligence is possible because the information presented to all of our senses are already intelligible by the meager human brain. The information we glean from all of our senses is similar to the information we obtain from reading written material. The reason why we can make sense of our senses is that something else has already prepared the sensory information to be intelligible. The source of intelligence is not vast networks, but instead from a deliberate sequence of intelligibility operations occurring in a supply chain fashion.
The current enthusiasm for big data technologies comes from many success stories of use of the data. Often the source of the success is attributed to the 3Vs of volume, velocity, and variety but in a specific construction of high velocity of large volumes of data in many varieties. In this interpretation that emphasize is on velocity and the burden is on technology to push more volume and variety toward higher velocities. We condense this concept into a notion of real time data. The technologies make available to decision-making all of the most immediately available data.
I agree there is value to rapid access to the most immediate information. Immediate access to current observations is essential for control systems and operations management. Also, decision makers often demand their analysts to include the most recent observations into conclusions about historic data. The immediate data is important for controlling operational systems and for identifying current opportunities for exploitation.
My concern is that the focus on the immediate observations distracts our attention from the real source of intelligence that comes from historic data. I described this in analogy to the difference between what we call hard and soft sciences. The immediate observations of the real world that remains available for our interaction is more solid than the historic evidence of a world that no longer exists. Despite the fact that historic data was once “hard” operational data, its increasing disconnect with the current reality makes its information more subject to debate and re-interpretation. Many (if not most) valuable human discoveries come from the debate over the now ambiguous historic data rather than from the unambiguous current observations. Our fascination with high-velocity of data can overlook the bigger opportunity for intelligence discovery from debate of historic data.
Operational data will continue to increase in importance for modern systems. The reputation of these immediately-available data sources needs protection through strategies such as bringing historic data to the data source rather than to ship the operational data to data warehouses (or data lakes). The emphasis of operational systems is on optimizing current opportunities. The ability to exploit those opportunities depends on defending the operational data and enriching that data with shared historical data. Data warehouses or data lakes facilitate a secondary market of shared historical data, but the operational advantage comes from closely-held and protected operational data. Velocity provides advantage by providing decision makers with exclusive access to the most recent observations. Velocity is the source of source of the type of intelligence that involves uncovering secrets, information already known by others.
While there is a lot to be learned by fast analysis of operational or recently obtained data, this area is receiving more than sufficient amount of attention and investment. I think the current emphasis on operational data is distorting the investments away from the more valuable project of learning from historical (no longer operational) data. Currently, we are investing in the challenge to instrument operational data streams for analysis at operation timescales. This is a big challenge and requires a lot of investment in labor and technology. I suspect most of this investment will be wasted. Even now, organizations are bragging of their capabilities of analyzing data in operational time while at the same time struggling to defend that investment with unambiguous examples of return traced to this capability.
There is limited opportunity for return on investment of analysis of very recent operational data. This return is analogous to feedback loops in control systems. It can introduce more control over the operational system. The common examples of fed-back signals involve acting on signals in the operational data to exercise some targeted action such as quick marketing efforts or stock exchanges. These can offer some return, but this return is naturally limited by the short time period available to take advantage of the opportunity. The window of opportunity is naturally short because new operational data will present distractions of new opportunities to pursue. Also, the competitors and customers also are employing their own operational data intelligence so that they will quickly close any advantage gap that may appear. In many cases, I believe the potential return for exploiting operational data will not justify the investment. This is unfortunate because that investment distracts the organization away from historical data that offers more durable knowledge discovery.
Discovering new truths previously unknown to anyone else usually comes from the study of historical data, data that is no longer attached to the current events. Historical data lost its momentum: it has zero velocity. Disconnected from the current events, previously operational data becomes ambiguous. Conflicts, contradictions, or gaps emerge as more historical data accumulates. Historical data ripens to reveal its secrets over time. Ignored, this data will eventually disappear.
In discussions of big data technologies, the terms velocity and real-time have multiple meanings. The first meaning is to make sense of the operational data during the immediate moments following the observation. At least that is the first meaning that comes to my mind. However, the same technology that delivers speed can also apply to obsolete observations, or historical data. Accumulated over time, the volume of historical data becomes very large. Fast algorithms working on large historical data sets permit the kind of intelligence discovery that I have been describing as unique to historical data.
The speed of the algorithms permits interactive analysis. Interactive analysis is sometimes described as real-time from the perspective of the analyst. The analyst experiences an interactive session that repeats multiple cycles of observing some question, posing some ad-hoc query, and then obtaining a result that poses new questions. This interactivity is sometimes considered to be a form of real-time analysis.
The same technologies that enable real-time analytics of freshly arriving data also enables interactive querying of historical data. I am more interested in the analysis of this historical data. I welcome the data technologies that allow me to rapidly explore the historical data with multiple iterations of asking questions and obtaining answers.
When I am most productive studying data, or using that data to communicate with others, I imagine an analogy of an intense conversation or argument between two intelligent and informed individuals. In ideal such conversation, both are answering each others points with new information that specifically answers the previous points and adds a little more to advance the argument. The new information also presents new opportunities that receiving person immediately counters with more data or an argument against the presented data. This analogy of a conversation has a contrast of a sporadic conversation between two people who don’t know a mutual language and have to look up answers or consult notes for each reply.
I believe that the best human learning occurs through conversation and argument. The best conversation is between two individuals who are well prepared to answer each other in real time, without a need for stuttering or taking a break to research some point. They have all the information they need at the time they need it. The rapid back and forth can be very challenging but it can lead to discovery of something new that neither party knew before. That discovery may be a mutually satisfying compromise, or it may be an unexpected third point of view that is at least more interesting than either of the two initial positions.
Knowledge discovery from data involves using data in place of language in an argument. Instead of rhetorical arguments, we access data to answer the opposing claims. For human encounters, such as the presentation from an analyst to a competently diligent decision maker, we welcome data tools that can retrieve data in real time during the encounter. The decision maker should be able to pull up questionable data to challenge the presenter, and the presenter should be able to respond by pulling up relevant explanatory or rebuttal data.
We can describe such tools as real-time data queries, but the real time is in terms of an argument rather than in terms of the age of the data. In this sense, we can have real-time query capabilities on very old data. A better term for this capability is interactive queries. When we can interact with the data at the speed of a conversation, we can imagine the data queries executing and returning in real time without interrupting the flow of the conversation with disruptively lengthy delays.
Again most of the time when I see promotion of real time data analysis, it is in the context of working with the most current data. The query results include the contributions of the most recent observations. However, most of the time, the query will also include far more historical data than the most recent data. It is not difficult to monitor data as it arrives and present it as some query response. It is more difficult to integrate that result with historical data, especially when the query needs to change in response to questions or other challenges. Given the fact that historical data (accumulated over long time periods) will almost always dwarf the immediate observation data, real-time analytics really means fast query of the historical data. In order to query in real time, the tools necessarily must be able to query historical data quickly.
Real time analytics requires ability to quickly query historical data. This is a capability that can be valuable even if there is no need for any recent information. For example, we can use these tools to discuss something that occurred yesterday, or some other period in the distant past. The presence of immediately observed data is optional for real-time analytics. Indeed, sometimes the immediately observed data may be irrelevant to the question, as would be the case of discussing what occurred yesterday.
In the example of interactive queries used for historical data, we are used to the fact that such queries take time to run. When the tools respond to quickly to questions that we may expect to require some research, we may suspect that the answers may not be valid. The answers being so quick may be canned and may not be relevant. Conversely, we are conditioned to believe that answers that take longer to retrieve may be more credible in terms of being retrieved from an actual data store — we may still question the accuracy and relevance of the data but we trust that the data did come from an actual store and was not merely prepared ahead of time. In this case, the most recent observations can provide a validation that the system is honestly querying the actual data store when it returns results quickly. The real-time observations included in the result will assure that recipient that the data is real data that matches the query. This use turns the definition backwards: the real time query capability essential for interactive discussions of historic data uses real-time (immediately recent observations) as evidence that the queries are actually pulling data from the data-store. The immediately recent data may then be discarded as irrelevant to the discussion, but it serves a purpose to verify that the answers are coming from data in response to the constructed query intended to be relevant to the current discussion.
Interactive reporting is indistinguishable from real-time reporting. Both involve the same high-speed query engine to pull up relevant historical data. The value however comes mostly from the interactive nature applied to historic data. While there are some instances where the value is specifically on the most recent observations (operational feedback loop systems such as program trading), most of the time the core value come from the far vaster historical data. We use the most recent observations as a credibility check where the retrieval of the recent information assures us that the query is operating on the actual data at the time the query was run.
This leads us to use to the term real-time analytics when we really want interactive analytics on historical data in order to discover new knowledge.