In my discussions on data science, I’ve explored the intersection of my experience with the big data trends. In particular, I focused on my particular experience in assigning trust to data types, or more accurately my assigning of various levels of suspicions about data. I don’t think I encountered a data source that I trusted.
But this is mostly a retrospective reinterpretation of what I was doing. I did not set out to do big data. I just happened to cover the same ground.
My motivation for entering the data sciences came from my background in simulation and modeling that mostly involves generating data for proposed scenarios. These simulations included both discrete time (time-step) simulations and discrete event (discontinuous time) simulations with the latter being my most recent events.
In such simulations there is a starting point in time with an initial set of conditions. From that point, the simulations calculate a new set of conditions at the next time interval. This new time and set of conditions could be seen as a new starting point and initial conditions for computing what happens in the next interval. The simulation proceeds in a repeating fashion that uses the last results to compute the next results. The simulation may augment the computed conditions with predetermined conditions assumed to be available at particular times, but the bulk of the conditions were computed based on earlier iterations.
An easy to visualize example is that of a simulation of a day at an airport terminal. The simulation models the arriving and departing aircraft. It models the flow of people embarking and disembarking the aircraft. It models how these people navigate through the concourses, security check points, seating areas, baggage claims, etc. It models non-passengers who arrive to meet or send off the passengers and the movements of these non-passengers.
I was intrigued by a situation that can happen in the middle of the simulated day. What happens when a flight is by chance delayed? In discrete event simulations, we often assign a probability that a flight will be delayed and that will occur randomly at the point of the day. All of the other events in terms of arriving and departing passengers are also random. As a result, the entire state of the airport at the time of the single flight delay is a random condition.
The simulation proceeds to predict what happens when that one flight is randomly delayed at the random state of the terminal at the time of that delay. A single run may be in informative but we usually demand multiple runs to gather sufficient scenarios of different terminal conditions at the time of the delay. There are two problems. One is that each run of the simulation takes a long time to run. The second is that since everything is random, the flight delays may happen at different parts of the day, at different gates, or may occur in multiples or not at all.
There are ways to adjust the simulations to narrow the possibilities. For example, we may randomize all of the other variables but set probability of 1 that a certain flight will be delayed at a certain gate and announced at a certain time. This is still time consuming. It also doesn’t tell us much because what we really want to know is what generally to expect for a delay at every possible time of the day and at every possible gate. The impracticality of comprehensively simulating every possible permutation is what leads to the original plan with randomized flight delays and concentrate on a statistical sample of the airport operating as a whole.
Such simulation practices were conceived of at a time when computing resources were expensive and in particular the data storage technologies were severely limited. To be practical, the simulations needed to work with reduced data in the form of statistical probability distribution functions. These probability distribution functions are input data to the simulation. This input data required a lot of prior investigation to justify the choice of the distribution function (its shape) and to quantify its parameters. The simulation worked by random-number generators to select values from these distribution functions.
As computing resources improved, these simulations took advantage of faster processing, more memory, and parallel processing techniques to run this basic approach at faster rates. Simulation technology focused on the mechanics of the simulation process itself.
I started to pay attention to the more overlooked problem of preparing the input probably distribution functions. Often the input statistical data provides most of the information for the value-added of the simulation. Although we take much pride in developing algorithms that reproduce physics, the value of a simulation is primarily determined by the quality of the input data. The input data is reduced data. The data reduction takes actual observations that would overwhelm a simulation and translates this data into a much more condensed form that we convince ourselves will recreate comparable observations through the use of appropriate random number generators. We expect that the simulated observations will be much like the actual observations except that they will not have to be stored.
But today, we are able to store and rapidly retrieve the actual historical observations. I wondered why not rethink the simulation problem (conjecturing what might happen if something were different) to use actual historical observations instead of statistical models.
The idea I had was to consider a period of time relevant to a particular flight delay. I could construct a table that would map the flight-relevant passengers. I could build a query that join with the observations of people in the terminal to identify these passengers. I then use more queries to estimate the resulting congestion that might occur at that gate.
This new set of queries can the be run every single day over the entire year. This provides a very realistic sample of multiple conditions at the scheduled flight time. Some of these conditions may by seasonal or calendar-related. Some of these conditions may be due to other conditions within the terminal.
What I found was that this query can run very fast with modern database query engines. At least it ran far faster than the old simulation modeling approaches.
The queries ran so fast that I could make a simple change to that initial mapping table to add a column for a flight number. This new table will map all flight-related passengers to all of the flights possible for this terminal. The new query answers the question of what happens when any particular flight is delayed for every possible flight on every observed day.
I found that this query also could run fast on modern database query optimization engines.
As I got proficient at this I was able to come up with more complicated scenarios to present results that would be useful for decision making.
I called this approach a set-theory based simulation instead of a procedural-language based simulation. I found it to be very competitive for certain types of questions we would normally expect from a slower procedural (discrete time or discrete event) simulation.
However, I should note the following important points.
First, there are really monstrous queries that can come close to overwhelming the database resources. I would run these queries on an isolated database engine with all of its resources devoted to these queries. This is acceptable because it is the same kind of investment to devote dedicated servers to run simulations. Using database engines this way is sometimes called OLAP (offline analytical processing) although that term has sense evolved to refer to other technologies. I’m specifically limiting myself to the same SQL that is available for online processing. I’m just using the queries on a separate instance with resources dedicated to these analytic queries.
Second, this approach specifically uses actual observations. These observations occurred in the absence of the proposed new condition. In reality, we would expect that the occurrence of a flight delay to change the observations. I had some ideas about how to use set theory approaches using observed data to predict these cascaded events, but I concede that these considerations are the strength of procedural simulation modeling.
Third, in truth my work had nothing to do with crowds in airports. I used this as something that I hope is easier to describe. I was introduced to the air terminal crowd problem early in my experience with discrete event simulations when I prepared a demonstration of using the technology for modeling crowds in an airport terminal. At the time (early 1990s), I found the scale of the problem to easily overwhelm discrete event simulation technology. Needless to say, I didn’t make a great impression and that project did not go anywhere for me. This occurred long before I started thinking about databases. That example will always stick in my mind as being a fascinating problem.
The set theory approach doesn’t replace procedural simulations. Instead, they can complement each other. For example, a procedural simulation may identify a particular stressful combination of conditions. We can then map those conditions into a database table that maps the conditions to the passengers affected and then run the query against all of the observed data in the historical data base.
Once the procedural simulation identifies a stressful combination of events, we don’t need to re-run the simulation to recreate those conditions. We can fix those conditions into our mapping table and the rerun that query as frequently as we like and at far faster speeds than the simulation can run. For example, this query can run daily with each new day of observed data.
Even though there is a qualitative difference between the procedural-based simulation and the set-theory based simulation, it turns out the many of the decision making questions are sufficiently satisfied by the results of the set theory approach. For example, the question may be considering whether to remodel a part of the concourse in such a way to expand the central hallway and decrease the gate seating area. This approach can identify if any event in the past would have caused unacceptable congestion in that area of the concourse. This may be precisely the information that is needed for the decision.
One of the problems faced by decision makers is that at some point they have to make decisions. Inevitably (it seems) on that fateful day some very notable event would have occurred just the previous day and they’d want to know whether the decision they are about to make would have made that event better or worse. The set theoretic approach can deliver that kind of answer in a timely manner.
That assumes that there has been an investment in building up the resources and skills to perform set-theory based simulations.