Data Science and Health Insurance

My thinking about the affordable care act (ACA or Obamacare) is entangled with my thinking about data science.

For background of my thinking, it seems ACA is more about insurance rather than delivery of health care.   The fundamental premise is that health care otherwise would be inaccessible without insurance coverage.   This lack of access may be refusal of a provider to deliver services to a patient without a health insurance plan, or it may be refusal by the patient to seek services in order to avoid out of pocket costs.   The latter is emphasized by the ACA’s requirement to provide zero-cost coverage for certain routine preventive health services.   The ACA removes certain disincentives to see a doctor or for a doctor to see a patient.

The affordability is from the perspective of the patient at the time of the delivery of health care services: the out of pocket expenses in a bill will be affordable.   To make this happen, the same individual has to pay more in premiums, and it is possible that the combined annual costs of premiums plus out of pocket expenses can be exceed what is affordable.    On the other hand, the affordability is not from the perspective of a healthy person having to pay the premiums.

The legislation is primarily about insurance rather than health care itself.    If one is in need for health care and if one is able to find available health services, then the cost of those services will be affordable because of the insurance.    As we have seen, the plans do have some restrictions of what is covered, particularly in terms of narrow provider networks and in limited formularies for drugs.  Assuming that the required services are covered, then there is still no guarantee that services will be available.  If the services are available, they will be covered with low out-of-pocket expenses.   The services may not be available.   An extreme example is the case of organ transplants, where there simply may not be a suitable donor organ available when the patient needs it.   However most examples are more routine such as the simple fact that all of the doctors available are fully booked with patients already.

I focus more on the insurance aspect than on the health care provider aspect.    From my perspective, the legislation is mostly about insurance.    Insurance itself is about balancing expenses with cash from premiums.   The legislation requires that this balance be very tight to avoid excessive profits in any particular year.    There are constraints about how fast premiums can rise over the years.   The assumption is that when averaged over the entire population, the averages will not change much over the years.    This assumption about very large populations may not be realized due to the geographic limitations of where a particular policy can be offered.  The legislation has mechanisms for cost transfers from less costly regions to more costly regions, but that is of limited value when each region has to set premiums to be very close to their own regional expenses.

Beneath all of the above somewhat obvious issues of financing access to health care, there is a common foundation of data.   Data is required to make all of this work.    Data is at the foundation of virtually everything about this act.   In my opinion this foundation is very weak and is not able to support the weight of all that is expected to stand on top of it.

One clear example of this was the surprise at how difficult and costly it was to deliver a seemingly simple web site for the most visible part of this system: the initial enrollment into an insurance plan.

As I tried to follow the news on this topic, I was disappointed that so much of the focus was on the software as if the problem was primarily about code.   Certainly there were plenty of areas to criticize code when it becomes available.  This is true for just about any software project.   Code is easy to read and every conceivable approach has a competing approach with different advantages and disadvantages.    There is no shortage of people who can criticize code and they’ve been doing a thorough job at it.

My disappointment was more in the lack of coverage about the problem with the data.   I see a fundamental difference between data and software designed to handle data.    Most of the time we approach the project as a purely software project.  For example, data is simply a software object.   Software code is built around data objects.    My observation is that a software data object is not the same thing as data.   A data object is a package that holds data and defines the operations that can be performed on that data.   Those software objects are useless until the package is actually filled with specific data.

I’m reminded of a very early argument with a hardware engineer concerning estimating failures in terms of mean to fail and mean time to repair.   The argument was whether software could ever fail in the way hardware can fail in the sense of wearing out.   Hardware may work well for a long time but over time may wear out for various reasons.   Software that works initially will continue to work forever.   I think this is true.   Software will always do the same thing.   It never wears out.

From my simulation experience, there was a time when we could write software to model a phone.  The software modeled phones so that it only did a few things (ring, give a dial tone, deliver voice signals) and in order to use it a human would have to move to a fixed location where the phone was installed.   If that simulation was built in the 1980s, I have no doubt it would still work today.  It didn’t wear out.  The problem is that modern phones don’t work that way any more.    The software still works, but the data has changed.

When we combined data with software and called it a data object then it became possible for software to essentially wear out or at least become obsolete.   The data no longer can fit in the package previously designed to carry it.

The problem comes up with how deal with that kind of failure.   I’m reminded again of that earlier argument with the hardware engineer (by the way, at the time I was on his side of the argument).   When we combined data and software into a software object we essentially bought into the notion that that combination made the project a hardware project.  We apply hardware concepts to our thinking about software objects.

If a software object fails, we essentially approach the problem in the same way that an auto mechanic will try to figure out why a car engine won’t start: there is something mechanically wrong with the engine that when fixed or replaced the engine will run fine again.  Likewise, we dismantle the software, find the broken object and try to fix it or replace it.

In fact, the object didn’t suddenly break.   Assuming that the object wasn’t broken from the start, the object still works.   The problem is that something about the data changed.    Again we are encumbered by that hardware analogy that was inherent from the very start of object-oriented software design: we think something must have failed and needs to be repaired or replaced.

Perhaps a better analogy would be complaining that the engine doesn’t run when the car is submerged in a lake.   That’s not something a mechanic is going to solve.  Either get the car back on dry land, or acquire some other vehicle designed to operate under water.   That’s what is happening when software objects stop working.  It is no longer suitable for the environment.

Nothing broke or failed.  Data was just being data.  Data exists outside of the software that is intended to manipulate it.   Software objects is our current best understanding of that data but that understanding is inherently limited.   We never fully understand all the possibilities of even current data.   We certainly don’t know what the data will be like in the future.

I assert that data inherently has an ability to surprise us.     That may part of the definition of data: data is that which has the potential to surprise us.   Software itself will not surprise us.    Software may initially surprise us for example when in a simulation using random numbers results in a surprising result.   But that software will always come to same result when the simulation starts with the same random number seed.

Data surprises us in a completely different way.

It was easy to imagine health insurance covering a risk pool wide enough so that included a majority of healthy people whose premiums would cover the expenses of the unhealthy.   Generally most people are mostly healthy.    What surprised us is that applying this algorithm to all of the possible markets resulted in such a wide distribution of premium costs needed to support the expenses.    Some pools may inherently be more unhealthy than others.  This is a discovery about the data, not about the software object.

It was easy to imagine that eligibility for subsidies can be based on annual income.   What surprised us is that so many people don’t know or can’t predict their annual income.    Perhaps what surprises us even more is that fact that a person may be mistaken in knowing his annual income: people can lose their jobs or get unexpected bonus.   The entire prospect of an impression of affordability depended heavily on the certainty of a person’s knowledge of his annual income.  That software object is not broken, instead it is simply not appropriate for the real world.

There were certainly some real software errors contributing to the problems with the health insurance marketplace websites.   However, many of the problems I read about looked more like data issues.   One example was the people being unable to establish their identity or the identity of their dependents.    In some cases there were errors in the underlying systems to check identity.  But in other cases there was real ambiguity of identity.   A person may have a different knowledge of a certain fact than what was recorded in official records.   This alone was surprising, but even more surprising is that there are so many people who have this problem.

The affordable care act implementation necessarily ties together a huge number of different data repositories and each of them are a mess.   That mess was hidden until we attempted to tie two piece of information: for example what a person thinks a fact is and how that fact was actually recorded.    Previously that was not detected because the data was used in a consistent way without needing the opinion of the individual.

From an object-oriented software perspective, the argument is that this is evidence of insufficient testing.   It is true that this kind problem could have been discovered in advance.   The problem was lurking in historical data.  It could have been found with sufficient (though hugely expensive) testing.

In contrast a data-science perspective takes the argument that we should not have been surprised at being surprised.   We should expect to be surprised.   We should have built the overall architecture with the certainty that data will always surprise us.   It is folly to assume we can build and test software to the point of eliminating surprises.

The above example that seemed fixable by adequate testing concerns current historical data.   It is possible that we could thoroughly test all records in historical data and prove to software handles all special cases properly.    That software will still fail with future data.    This data is about humans where even instinctual behaviors are hard to predict.   Humans also tend to change behaviors based on current incentives: humans deliberately change in ways that will invalidate software about them.   Humans participate in society that is rapidly changing in terms of how we define identity, income, and even health care expectations.

The initial (and current) implementation of the affordable care act is based on our confidence in software objects.  Eventually we will build solid software objects that will never fail.    That project is doomed to fail.    The right approach is to focus on the data and respect the inherent nature of data’s ability to surprise us.   The challenge is the data, not the software.


2 thoughts on “Data Science and Health Insurance

  1. list showing the problems of the various exchanges. The problems are described in terms of costs and effectiveness of the programs, again as if the problem was incompetent application of technology. I believe the actual problem is lack of understanding of the data that the technology must deal with.

  2. Pingback: Data Science and Health Insurance | Hypothesis Discovery

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s