Dark data of coincidences, followed by consequences

In earlier posts, I described the appeal of big data analysis to find answers to questions that lack any direct observations. The idea is that enriching observations with additional details can expose patterns that can provide information otherwise we would miss.

It is like a business marketing study of big data that finds that there is a high share of customers sharing some combination of seemingly unrelated traits. For instance a product such as shoe polish is found to be very popular among people who enjoy orange juice and corn flakes for breakfast and own at least two bicycles. This would suggest some new advertising campaign.

These are relatively modern ambitions of the big data and data mining technologies. But today I was reminded of a much earlier idea popularized by the Minnesota Multiphasic Personality Inventory (MMPI). This is a test of a large number of true false questions about everyday kinds of experiences. The individual questions relate to preferences that anyone may have and are not obviously connected with anything unusual or concerning in terms of mental health. Empirical analysis of the results with individuals with diagnosed personality of mental health traits showed patterns of preferences that are shared with certain types of characteristics.

I emphasize my usual caveat that I know very little about psychology and even less about MMPI, but I do see a similarity in the goals of MMPI and with broader programs of Big Data and Analytics. It is to find patterns in unrelated observations that correlate or suggest a relationship that we otherwise would not be able to observe. Another thing is that for decision making, correlation is enough. There is no reason to show cause or effect in either direction. It is sufficient to note the clustering is coincidental with a particular trait.

In my past posts I tried to itemize different types of dark data that I define as made up data usually generated from models. I use the term dark data to indicate that I think it should be used carefully and with suspicion that it may be misleading. However, I recognize it has value. I only note that it is not as valuable as direct well-documented and controlled observations. In any event my itemization left out the type of dark data that is exploited by MMPI and is the goal of many Big Data analysis efforts. That missing type is empirical clustering of properties with something we want to know.

Another earlier description of this is the idea of the sum being greater than the sum of the parts. There can be new information generated by combining observations of unrelated information. This information is found empirically and is tested empirically. As in the case of MMPI, it is a cyclical process of empirical identification of patterns, testing the results, and then updating the mapping of patterns to the sought information.

This is a self-contained approach where the same data both suggests an hypothesis and then confirms that hypothesis.

I prefer to separate the two concepts of hypothesis discovery and hypothesis testing. Big data analytics can trigger us to propose new hypothesis. But I prefer to turn that over to science to test the hypothesis with carefully designed and controlled collection of observations. I am aware that MMPI is supported by a high degree of scientific rigor, but I fear the same doesn’t apply to most modern Big Data efforts.

All of this thinking was triggered when I was watching a YouTube video of a young presenter presenting some concerns about the Common Core education program. I enjoyed watching the video primarily because I was admiring the presentation performance of the young presenter. However, at around the 22:00 mark, he presented something that was something new to me. At this point he introduces a technology product that uses big data techniques of standardized tests to associate the results with certain careers. This sounds a lot like using MMPI techniques but for matching people with careers instead of with personality traits.

Even more telling is that they come up with these matches by studying the answers to individual questions, not just overall scores in skill areas. Answering a particular way for a particular set of questions clusters well with certain career choices. Perhaps the results are crude initially, but the goal is that they will improve over the coming decades as we gather more data on career outcomes for people answering questions in a particular pattern of getting the answer right or wrong.

Now that he pointed this out, it seems obvious. Big data approaches are so exciting we are going to try to apply them everywhere. And if answering a bunch of random questions can suggest certain personality traits then answering a particular set of skill questions can suggest career paths.

The thing that is disturbing is that MMPI works because it is copyrighted and strictly standardized globally. This strictness has the benefit of tightly controlling the collection of observations about particular questions. The ever increasing pool of results enhances the empirical analysis because the exam questions are tightly controlled.

I don’t think this was an incentive for producing common core but it clearly enables a big data product that can make this association. As the presenter mentions, the product promises a future where the exam results can tell a person what career would suit him best.

There is a lot of enthusiasm about the possibilities that are possible by combing large data sets with large number of dimensions (such as individual exam questions). In the past, big data solutions worked with as-is data, data that was collected for unrelated purposes and from various sources with no standards for consistencies. An example is combining marketing data with sales data with human resource data: the data has no common consistent definition of concepts even of something basic like a mailing address. The challenge of the data scientist is to work out ways to make the data comparable.

The big data goals would be much easier if instead standards can be pushed out closer to the observation sensors. That every source of data follows a consistent data dictionary for instance. In most scenarios, this is nearly impossible.

But with common core and the way it is rolling out, it is a real possibility. There is a centralized standard for managing individual exam questions at particular grade levels. There is the possibility of broad consistency over space and over time.

I have a lot of concerns about common core especially in how it adversely affects the possibility of individual fellowship between student and teacher. But at least I had some comfort that perhaps this is a temporary experiment, that it can evolve over time to become some kind of hybrid approach. In particular that it will allow for some flexibility in terms of how and when to test.

Now I fear that that flexibility will be impractical. Just like the MMPI, by necessity it will have to be very rigid and any changes made must be made carefully and centrally. The reason is that the testing is not just to provide momentary assessment of educational progress. The testing is essential for managing the entire education process of individual students and steering them into appropriate careers.

Big data offers an opportunity to find patterns of seemingly unrelated factors that coincide well with information we would like to learn (such as career choice). The benefits (real or imagined) will inevitably have the consequences of enforcing standardized methods for data recording so that they can be compared across large populations from multiple sources and over long time periods.

Common core may be just the beginning of regulating other activities so that there records could benefit exploitation by big data techniques.

Hypothesis Discovery

Listening to Data

Dark data of coincidences, followed by consequences

Leave a comment Cancel reply

Share this:

Related posts

Leave a comment Cancel reply