Occam’s Razor in age of big data

Occam’s razor concept is to prefer simpler explanations for information, or demanding more complicated explanations have more explanatory power.   In the original formulation, simplicity was related to fewer number of contributing factors.    The number of factors also gets incorporated into the very name of calling it a razor.   The use of the razor is to remove excess factors.

Our modern definition of scientific method implies some use or reference to this concept of preference for explanations with the fewest possible number of contributing factors.   This is not always explicitly stated, but we do demand from more complicated explanations an additional burden of making more or stronger predictions or explanations of historical data.

There is good justification for selecting the simplest explanation.   To advance knowledge, we need to find explanations that are easy to put into practice and easy to communicate to others.     We want to apply this knowledge rapidly and have it spread widely.   Simpler explanations facilitate that project.

Unfortunately, we equate simplicity to how it was experienced when the concept was first developed many centuries ago.   At that earlier time, there were fewer technologies available for collecting, cataloging, and interpreting data.    New observations were difficult and costly to obtain, and very expensive to distribute.    The concepts also had to be expressed largely in human language either verbally or through manuscripts.     With those constraints, there was a strong economic case for preference of succinct explanations.

Modern technologies have greatly changed this economic case.    We have access to inexpensive technologies that can rapidly collect large quantities of multiple-factor observations.   Information technologies allow this information to be stored and transmitted inexpensively and quickly to a wide audience.   Also, information technologies greatly expand the analytic capacities of individuals and makes this capacity available to a wide population.    In short, we can handle much more complexity today than we could in centuries past.   A lot of what we consider simple would be impossibly difficult for earlier times.

One of the areas that are simpler is the ability to handle more factors in our explanations.     This is the basis of the multidimensional data warehouses allowing for querying large subsets from a larger number of dimensions associated with a particular measure.

While there is still a preference for simpler explanations or for demanding more from more complicated explanations, we should consider carefully by what we mean by simpler and more complicated.    With modern technologies, there may be multiple equally simple explanations involving a wide range of different number of factors.   It may be as easy and inexpensive to work with a 10-factor explanation as it is to work with a 4-factor one.

Different explanations should not be distinguished in terms of Occam’s razor if they are nearly equally easy for analysts to learn, comprehend, apply, or communicate this knowledge.    Given modern technologies, this permits a wide ranges of different explanations that should not be dismissed based on some outdated notion of what is simpler (such as a fewer number of factors).

Likewise we can relax our demand for the additional explanatory value of additional considerations.   The economy of modern technology permits us to enjoy smaller advantages of more comprehensive explanations.

One of the appeals of big data projects is that the technology can absorb huge complex data sets and make that data accessible to a wide audience.    Unfortunately, older ideals of elegant solutions can get in the way of fully exploiting these new capabilities.   Frequently we still seek out simple explanations that our ancestors would appreciate.    Implicitly or explicitly, we are invoking Occam’s razor and perhaps to our disadvantage.

For example, in an earlier post I described the easily measured body-mass index (BMI).   There have been a number of explanations of particular health conditions being predictable based on this single factor.   Such a single factor explanations become very popular even as some point out that BMI is can be misleading or that other factors may be important.    Although BMI is a single factor, it is one that is computed from two measures (height and weight).   There are readily available online calculators that require entering in two numbers and it will provide the computed assessment of health risks.    Such technology could easily ask for more factors, and often they do by allowing for entries for sex and age.   The point is that many users are encountering the calculator to assess their health and they could enter a wide ranges of factors as easily as entering their height and weight.   The calculators could ask for income range, employment status, lifestyle, etc.    Even if the list is so large that different people have to leave some items unanswered, there is enough additional information that could be used to match the individual to certain health risks.    There is little difference in simplicity of such longer assessments compared with BMI.    BMI remains popular because it provides single factor explanations that could be appreciated by our ancestors: it can be discussed easily and quickly in human language.

We are no longer constrained to communicate ideas through human language (print advertisements, news synopsis, etc.).    We can enter a wealth of factors into a form for a computational or query engine to compute an assessment based on very complex algorithms and extensive data.    We can reuse the knowledge without having to be able to recite that knowledge from memory in human language.

Another example is my post on historic trends for baby names, where the trends are shown in one site as a simple decade trends for specific baby names.   I noted that the interactive chart with this simple information was very interesting.   The referenced website had a modest goal of providing this type of information for parents or expecting parents.    In that post, I suggested how this same data may shed some light about the social dynamics as a whole.  I made some conjectures just based on the shape of the frequency curves over time.   My point today is the data was collapsed to a simple relationship when a similarly simple tool could be available to explore a more complete set of data.   That more complete data may include such information as the geographic region (state and county), the parent’s income bracket (or the county’s median income), the overall birthrate within that region, etc.   This much richer data in increasingly available and tools exist to allow just as easy access to that data.    Such data could show for example that a particular name’s popularity tracks with the relative birth rates of a particular region where that name was more common.   Or it may start to support my initial conjecture that names reflect the hopes or concerns the parents are feeling in their own lives.  My point is that we have access to simple and effective query tools to allow us to navigate through such richer data sets and thus there is no need for simplification for expediency sake.    Our simplifying the data invokes an older tradition of keeping it simple.

We may also be unnecessarily invoking Occam’s razor in the design of experiments.   An earlier post described a breeding experiment involving mice where one had an instinct to build escape tunnels for its den and the other did not.   Just based on the simplified video presentation, it appeared that the experiment was focused on a single observation (burrow building) based on one factor (genetics).    The execution of the experiment appears to be good enough to impress earlier scientists.    The hypothesis assumes a simple genetic explanation (consistent with Occam’s razor) and the design of the experiment is focused narrowly on that relationship.   This is consistent with experiments that were performed 100 years ago.   Even though the experiment was testing a single simple hypothesis, modern technology allows us to collect a lot of additional observations that might provides more information about what is going on.   Observations of all other behaviors could provide some information to say that there is much more varied behaviors affected by the same genes.   Perhaps the gene also changes a preference for a certain type of food, or changes the time spent grooming or sleeping.    At least this would weaken the case that the gene is specifically for instructing the mouse to build an escape tunnel.   We should take advantage our affordable technology to collect, store, query, and share this additional observations.

We may be unnecessarily invoking Occam’s razor in our use of data for enforcing laws or preventing crimes.   In my last post, I introduce a couple recent news examples of where innocent people were unnecessarily subjected to search or arrest.   In these cases, there is an excessively simple model involving a single factor that suggests a purchase of a certain quantity and type of product from a garden supply store can indicated reasonable cause for suspicion of cultivating marijuana plants.    There is a suggestion that modern law enforcement is gaining increasing confidence in their data models to allow them to more aggressively prosecute based on limited information.    I do not know what explains this confidence, but I can imagine that multiple factors can suggest such activities with high confidence and that confidence is not reduced too much when reducing the factors to just one or two factors.    In other words, it appears like an application of Occam’s razor in the sense of removing factors from an earlier multiple-factor explanation with strong predictive power.    The reduction in factors retains enough residual predictive power to provide confidence for more aggressive action.

In the above examples, I do not think that Occam’s razor was explicitly invoked.  More likely, we are using practices we inherited from earlier generations who based their practices in part on the principles of keeping things easy to understand with more limited information technologies.     Today we still strive for explanations we can discuss verbally in short articles, lectures, or videos.    We are denying ourselves the full benefit of our information technologies by requiring our explanations to be translated into human language.

The modern technologies allow for simple and widespread access to very rich data.   What we lack is a culture to communicate ideas through these technologies.   As I described in this post, we need to build a new culture that appreciates the new medium of big data stores and that has the skills to query that data to explore the multidimensional depth of the data.    Culturally we are holding ourselves back by demanding unnecessarily simple explanations based on tradition that was developed at a time that needed simpler explanations.


3 thoughts on “Occam’s Razor in age of big data

  1. Pingback: Science based on Observations | kenneumeister

  2. Pingback: Occam’s Razor in age of big data | Hypothesis Discovery

  3. Pingback: Science based on Observations | Hypothesis Discovery

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s