Big Data predictive analytics: A Horoscope for our times

Big data, or what I would prefer to think of as crowd data, and its associated technologies and algorithms brings the possibility of showing proof of patterns that occur especially when we impose categories on the crowds.

The motivation for this post was a response I have to the increasingly common presentation of data analytics of human populations presented as scientifically endorsed because they involve lots of data and appropriate statistical analysis.    In particular, I objected to the notion of assigning an entire population of individuals into a label popularly referred to as the Millenial generation, roughly defined by people who were born in the 1980s in 1990s but sometimes even later.

Even though the term is used loosely, a particular scientific study that compares data about different generations may define a precise range of birth years.   One study may define it as exactly 1981 to 1998, but different studies may define it differently.

From a perspective of someone with experience summarizing queried data, I appreciate the convenience of creating ranges as categories in order to compare aggregates of different categories.    With crowd data (big data) it is terribly inconvenient to look at each data point individually.   Categories make analysis much more practical.

The challenge is choosing what to use for categories.   I am no authority on the study of demographics, but it seems a relatively recent (within my lifetime) phenomena to popularize the concept of a fixed generation to place individuals.   My impression is that the notion of generations became popularized with the 1970’s craze about the unusual demographics of what was called the baby boom or now shortened as the boomers.

Again, it is only my impression, but it seems that after the popularity that there is a boom generation, then there must be other generations, and there was a concept of dividing the population into generations (roughly 18-20 years) since the start of the country.   This gave a convenience of labeling them as numbers.   Generation X was the tenth generation since the start of USA.   I recall for a brief time it seemed that we would just keep that numbering sequence but then they decided that generation Y was next.    I have no idea who belonged to generation A, but I’m sure someone somewhere has located that generation.

Most data analysis projects relies heavily on the reduction of continuous measures into discrete categories defined by a range of numbers.    The problem with these categories is that they need names in order to show them on charts or pictures.    In my own experience, I attempted to take a literal approach and name a category (using this example) such as the 1981-1998 bin of the birth-year category.    That was a very honest presentation of the arbitrariness of the designation and the lack of justification for why the dates were chosen.    In terms of grabbing people’s attention, the attempt failed.

An audience presented with the concept of categories will demand descriptive labels for each category.   That same audience will demand some meaning behind those labels.    That very same category described as 1981-1998 becomes more satisfyingly labeled Millennial.   As a label, Millennial stands in contrast to the older generations (X, Y, Boomer, silent, Bust, greatest, etc).    Also the label has great potential of meaning capturing the mystical calendar event of the year 2000, the real-world events that defined this generations’ youth such as the rapid and widespread availability of Internet and mobile communications.

So now we defined the same category with a colorful label that has a meaning.  The change of starting dates with a 2 instead of a 1, or the changes of having ready access to highly powerful personal electronic devices enabling instant worldwide communications must have left an imprint on this generation.   The label and its meaning must have made fundamental consequences on the character, fortunes, and fates of this generation that sets this generation apart from all of the other generations.

The meaningfully labeled generations can be compared and contrasted with each other.  The baby boomers enjoyed a growing world economy with US dominance while the Millennials suffer a decline in US dominance and economy.    In contrast, the Millennials enjoyed immersion of personal electronics, computing, and communication devices during their formative years allowing them to leverage these tools of the new economy more effectively.

There are plentiful examples of contrasts of Silent, Boomer, X, Y, Millennial, or what’s next.   Many of them are based on some anecdotal evidence.  Some of that anecdotal evidence is supported by crowd data and given scientific credibility by the acceptable application of statistical tests.

So, here we are.   I’m a boomer.   I live in Arlington Virginia, and have lived here since the 1980s.  However, I live near the metro line corridor that encourage building lots of apartment buildings that are very attractive to younger generations.   I live surrounded by much younger people.

As far as I should be concerned, they are peers of mine who happen to be much younger.    However, this is not how things turn out in social encounters or even business encounters.

It reminds me of the 1960s era trend (at least as depicted in media) of the first question asked of a stranger is “what is your sign”.    I’m Libra, you’re Capricorn, oh, we are not going to get along.

Astrology lingers in our culture today but it doesn’t come up often in conversation.  When it does, often one of the speakers would not even know his sign.

But we recognize generations.

I think there is some credibility to the possibility of there being consequences of being born at different times of the year.   The crucial formative first few months will leave different lasting impressions if those months were characterized by lengthening daylight, warmer weather, more time outside, an vibrant nature compared in contrast to months characterized by shortening daylight, declining nature, and more time indoors.    We accept the notion that each successive month of life for the first 18 or so months leave critical formative impressions on the developing personality.    It could matter a lot if the winter celebrations of happy music, bright decorations, and frequent parties or feasts occurred during a child’s 2nd month instead of his 10th month.

I am am inclined to believe it might be useful to make categories of personalities based on birth months.   Again once we introduce the concept of categories, our audience will demand colorful labels for the categories, and meaning to the labels chosen.

Libra is a much more satisfying label for the birth-month of somewhere around October.  Libra conjures an image that sets it apart from the other labels.   We can ask around our acquaintances and notice that those within the Libra category appear to share some qualities that we can associate with the image.   Libras are balanced, or they are balancers (taking measure), or they are frustrated by imbalance, or they strive to add counter weights to find the right balance.   We can can observe some kind of balance metaphor in the data we assigned to this category.

It is not hard to imagine that Astrology was the original crowd data predictive analytic project.  Just as in predictive analytics, they needed categories and based it on months that are most consitently defined by positions of the zodiac.   July is summer in Northern Hemisphere but winter in the Southern hemisphere but both hemispheres can observe the same zodiac constellations.   They then collected observations of their surrounding crowds and began to derive predictions specific to the labels.

Of course, there plenty of flaws in their approach.    Even if they were very diligent, they were limited to data they can observe locally.   It was not possible to collect observations on a global scale needed to build a zodiac-based model.   By the time of the modern age, the predictive power of their theory is readily dismissable because their built-in allowances are impossible to falsify.

The motivations and practices of the developers of Astrology were similar to the motivations and practices of crowd data predictive analytics.   Astrology starts with a reasonable observation that birth months can imprint on a personality.  Astrology sought out evidence to define or refine the characteristics of each category.   Astrology sought to make predictions about currently encountered people.

1960s: “Hi I’m a Libra.  Oh, you’re a Capricorn.   We will never understand each other as well we can understand those with the same sign.”

2010s: “Hi, I’m a baby Boomer.   Oh, you are a Millennial.   We will never understand each other as well as we can understand our own generation.”

Today we have access to petabytes of global crowd data.  We have trusted statistical algorithms to apply to arbitrary categories.   Categories are just the right size when they divide a population into about a dozen different labels.   After assigning data to their categories, algorithms will discover differences and those differences can have predictive powers (or merely be spurious).

One of the themes that frequently recur in my posts is my assertion that data is always historical.   Categorizing data is a useful trick to simplify the interpretation of large amounts of historical data.   Where there are many options for defining boundaries between categories, the eventual selection of a dozen or so categories helps tremendously in gaining an understanding of what happened in the past.   The project can lead to new discoveries.   Summarizing categories of data lies at the heart of what I call hypothesis discovery: the identification of new hypotheses that demand future testing.

Another theme I return to is that a discovered hypothesis is one step removed from the data and at least two steps removed from a decision.   A discovered hypothesis resides in the same domain as the earliest Astrological personality assessments.   It suggests a theory that needs testing that diligently scrutinizes it predictive powers.

The studies of collective characteristics of different generations may provide some useful observations of how different times are influencing different generations.   I raise an objection to the notion that these discovered observations be predictive.

Perhaps it is futile on my part, but I want to impose a one way mapping of datum to category.   Assigning to a category to a datum (such as an individual) is a very useful tool for analyzing historical data, especially crowd data.   I object to the practice of identifying the datum by its category.

In the supposed introductions above, the fact that I’m a Libra is a refinement of my identity.  Instead of being some individual named Ken Neumeister, I’m to be known as Ken Neumeister of the Libra family of the Boomer tribe.

It is reasonable to object to having my present person identified with a particular arbitrary label with the inevitable result of predicting who I am or will become based on that label.   I respect the autonomy and free will of individuals who should be observed and evaluated on their own merits.   I see no reason why people shouldn’t demand that respect from everyone.

Even if there is some predictive power to the categories, I may object to being assigned to that category.

Personally, I have never considered myself as part of the baby boomer generation for a variety of reasons.   My birth year does fall between 1946 and 1964, but I don’t belong to the stereotypical notions of that generation.   For example, I have never appreciated music identified by the names of the band that had exclusive rights to play it.  I recognize the songs as being popular when I was young, but they were songs meant for someone else’s enjoyment.

Finally, I raise an objection as a data scientist.   Informing an individual of his predictive label presupposes what is expected from that individual.    At least for humans, the individual will recognize the meaning of this label and will react to that meaning.   The individual may adapt to conform to the label, or the individual may rebel against the implications of the label.

Categorical labels offer their strongest benefits when the labels are conceived and assigned after the fact, when applied to historical data to interpret what happened in the past.    Using categories to predict future behavior will have the inevitable effect of biasing the behavior to conform or reject that category.     For the social animals that humans are, the tendency is to conform to the expectations of the group.   I see this happening today with the influences of the Millennial label on young people.

The prediction becomes self-fulfilling.  This self-fulfilling evidence reinforces the validity of the predictive approaches and that encourages its broader application.    Predictive analytics become unfalsifiable when the individuals being measured are aware of their categories and the expectation of shared traits within that category.

Eventually, the Age of Data reincarnates the Age of Aquarius.

Advertisements

One thought on “Big Data predictive analytics: A Horoscope for our times

  1. Pingback: Thoughts about the nature of intelligence | kenneumeister

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s