In my prior job, I was working with large amounts of data with a lot of properties. To do my tasks, I created sometimes complex SQL queries to get the information I wanted. The project started with modest goals but with schedules as short as a few hours to not only come up with a new algorithm but to apply it against the data to present the results. I didn’t pay much attention to what category of career I was participating in. It was just what needed to be done at the time.
It was only later that I learned that I was not alone. Lots of people were doing the same thing with different data. Unlike myself, they were using more modern tools than the SQL I was using. In effect, my efforts were reinventing in isolation what a larger community has been developing. I was far behind.
These fields that I was not paying much attention to are called Big Data and Business Intelligence. The two terms mean different things but there is enough overlap that they are sometimes confused. My work involved doing things in both.
As a late comer to the party, it was fun to find a big party occurring. Lots of people doing exciting things kind of like what I was doing. But also, it was kind disappointing because I was using the concepts in a different way.
The technologies are broadly applicable and can be useful everywhere. But it is disappointing that there the community shows preferences for certain types of issues.
For example, Big Data necessarily has to be big. Big has a particular meaning of bigger than what you have been doing. The focus is on more data entries, more properties for these entries. Each year, the bar for Big keeps being raised in terms of some minimum number of petabytes or some minimum number of dimensions. Below that minimum, you have mundane, routine little data.
There is a good point to make the distinction. Little data as defined above is not technologically challenging. Off the shelf hardware and software is readily available to easily handle these jobs. Abundant candidates are available who are qualified and capable of making it work. It is not challenging.
Maybe it is because I’m so old when I thought of data in terms of what can be stored on a programmable calculator, but I do distinguish little data from big data based on the qualities of the data irrespective of the cost of technologies.
I have done what I call “Big Data” on a calculator. Obviously, this is ridiculous. I didn’t give it the name big data. I just looked around and found out that what I did fell into the same things that fall under the term Big Data.
Also, it seems a lot of the focus of Big Data is on tracking each specific instance of data. In particular, Big Data is the fast and effective ability to retrieve very specific original (not massaged) data point in order to inspect it. This is the area of identifying and locating individual assets (including people). This can be looking either for good guys (such as specific contacts for a sales call) or for bad guys (such as rule breakers).
There is a difference between retrieving and accounting of data. I accounted for the same data but in aggregates. I mapped the original data into useful categories for supporting policy decision making. The aggregates I worked with combined large number of original data points. If the original data was big, then my data was a whole lot smaller. It just turned out that I ended up with a lot of aggregated data points: not as many as the original data points, but still a lot.
My definition of big data is not the same as the much more popular definition. I need a different term.
Then there is Business Intelligence. In particular, multi-dimensional data. This can have huge data sets that qualify as big data but the emphasis is managing the number of properties (or dimensions) of the data points. The bigness is the complexity of the relationships of these properties and finding efficient ways to summarize these into results.
The market for Business Intelligence is … business. The bulk of the emphasis is that the results are somehow targeted to a unit of monetary of currency. Reduce data to a dollar sign. What characterizes areas were we are experiencing losses, what areas show potential for more revenues or profits. The term is so popular and overused that products can be sold as BI by simply scaling some non-monetary measure by a factor of dollars-per-measure: show the results in terms of dollars and you have BI.
That was not what I was doing. I like to think that what I was doing was having an impact on the bottom line, but I have never been asked to quantify exactly how much. On the other hand, my services were popular so someone who cares return-on-investment must have appreciated my services.
I didn’t calculate in terms of money. In fact, if I noticed that a question started to drift in the terms of money, I said this is someone else’s job. Let someone else have fun working that problem. I have plenty to keep myself busy.
Also, I think the calculation of money is really tricky. My favorite example is the return-on-investment. Many proposals and vendors will brag about a return on the investment. But the problem is that a particular decision may involve accepting multiple proposals and vendors into a single combination. In aggregate the total cost may be smaller then the total return. But what which particular proposal or vendor was was responsible for what particular part of the return. Usually, it comes down to a calculation that each single part was responsible for the entire return. This is kind of true because the solution wouldn’t have worked without that piece. Everyone cannot claim the entire return for their cost, but it seems everyone does. I am happy to leave it to others to argue about how much money is involved one way or the other.
I use business intelligence techniques for something that is not about sales, marketing, inventories, or salaries. I used big data for data that is manageable with off-the-shelf mid-range hardware and software.
A different term for business intelligence is multidimensional data. Dimensions are essentially columns of a database record. Multiple-column data has been around since the first days of databases.
My definitions for big data and business intelligence (at least in terms of what I saw myself doing) are recursive:
- Multidimensional data is data that can be summarized into fewer dimensions and the result is still multidimensional. Non-multidimensional data is something that can be fully represented in some type of chart or graph.
- Big data is data that can be summarized into a data set that is big data. Data ceases to be big if it can be fully captured in a document or a chart that a human can be expected to comprehend.
But my work was both at the same time: there is no reason to describe it as two concepts. There was a unifying understanding of what I did. My clients knew what what I could do, they referred to it as the work that they can assign to me.
The title of this post is what I think better describes the discipline: serendipitous data. This is existing data that was collected for one purpose. Usually this purpose is short-lived for some operational need. This existing data is archived and this archival can become very large. Serendipitous data is the discovery of new information from this archive that was not specifically designed or intended by the developers of the original data. Serendipity is a good way to describe this process because it is often surprising to find that this information is available through relationships of aggregated categories of seemingly unrelated data.
Perhaps there is another term, but Serendipitous Data describes what I find so compelling about this field I serendipitously found myself in.
One thought on “Serendipitous Data”
Pingback: What makes data possible | kenneumeister