Risk of predictive analytics taking data too far

A recurring theme of many earlier posts is that the benefit of historical data is to discover new hypotheses with enough credibility to justify investment into future investigation.   The necessary investment for a discovered hypothesis is to create new experiments to test that hypothesis where those experiment carefully collect new data controlled specifically for the hypothesis.   A tested hypothesis then becomes trusted to use for prediction or control using yet newer data specifically controlled to match the needs of the hypothesis.

In contrast to this three step process each using new data collected to meet the needs of the hypothesis, the predictive analytics project performs all three steps on the same data as I described in this post.   Predictive analytics can expose a project to new forms of legal or civil liability risk to an organization.   This risk can come from this recycling of data to take shortcuts from accepted practice.   It is possible that organizations with good data practices in terms of handling and protecting data can encounter new risks when they use this same data in predictive analytics projects.

The sales of predictive analytics suggests there is no such added risk for using this technology.   If an organization deploys predictive analytics following the same good practices as handling earlier stages of data science and continues using its processes for approving policy changes, then predictive analytics by itself does not add any additional risks.   Predictive analytics is just an additional view into the already secure data for the already secure policy making.

In my view, predictive analytics introduces new information.   That information may be derived from existing data, but it does introduce new information.  Predictive analytics invents  three levels of new information as described above: the new hypothesis, the test-derived trust in that hypothesis, and the predictive or prescriptive information from using that hypothesis.

The only possible value added for predictive analytics is for it to introduce new information for policy making.   The risk is that predictive information can expose the existing decision making policies to the charge of fraud or negligence to perform due diligence.

One of the risks of any new technology is the risk for future legal scrutiny that can make good cases of fraud.   Even when the accused makes a good case that this was not intentional, they may still be exposed to negligence to perform due diligence to recognize obvious warning signs.

Are there any warning signs that we should be cautious of predictive analytics exposing fraudulent data into our decision making?   My earlier post linked above about spurious correlations suggests that there are plenty of opportunities to find patterns that have no basis for trust.   It is fair to observe a new hypothesis based on unexpected correlations of historical data.   It becomes more risky when statistical testing of this new hypothesis reuses this data instead of seeking new data specifically controlled for the hypothesis.   Further risk occurs when using the predictive or prescriptive recommendations of this hypothesis on the same data without specifically controlling data to be applicable to the hypothesis.

The consideration of using predictive analytics, during both the planning and operational phases, needs to include the risks that the process could introduce risk of become subject to legal or civil claims.   In most cases this assessment may conclude that the risk is acceptable.   However, there is still a need to scrutinize the process carefully.   In particular, it is dangerous to not conduct this risk assessment at all by assuming that advanced analytics presents no new additional risks.

There is an acceptance that analysis of data always has the risk of being wrong.   The very nature of most statistical tests involves some confidence test such as one where there is a 1 in 20 chance of being wrong.    If such tests are wrong, there is some defense in saying we were informed of the risk in the first place.    That defense fails, however, if that calculation of confidence was incompetently computed, or if there were some systematic process that ignores evidence that would reduce that confidence.

Predictive analytics consist of algorithms specifically to do a task involving data so voluminous and complex that no human can perform the task.   The algorithms will automate the iterations of multiple approaches to find patterns, test and compare patterns, select promising patterns and the specific types of data that optimizes that pattern.

Predictive analytics automate what humans previously did.   The procedures humans previously performed had the risk of introducing fraud or negligence.   It is very possible that the automation of this approach can do the same thing.   It is possible to automate fraud and negligence.   There is risk inherent in the assertion that the algorithms are performing operations too complex for humans to interpret.   The organization accepts these algorithms as a suitable replacement for human analysts.

What is the basis of trusting anything from predictive analytics if there is no practical opportunity for humans to take personal responsibility for the results?   If the algorithms specifically handle data too complex for at least the available staff to understand, then how can that staff take responsibility for the correctness of the results?

At the end of the process, a person must make a case to a decision maker to recommend action.   The decision maker needs assurance that the recommendation represents best practices.    With predictive analytics, the only possible option is to rest the entire assurance on the trust of the predictive analytics software.   No one on staff is in a position to question let alone validate the results of the recommendations.   The process was fully automated from hypothesis discovery to hypothesis testing to prescriptive or predictive answers using the same data.   The only thing for humans to do is to make stories about how the recommendations might make sense.

The above decision making assumes that no humans are involved at all.    In many recent examples, the predictions or prescriptions from the algorithms are immediately implemented.   The example I heard was of online advertisement optimizations updating their advertisement targeting multiple times a day automatically by the vast amount of data it is collecting.    This is a safe example because there is little risk involved.   If the algorithms get the prediction or prescription wrong, there is only a loss of revenue.   But even here, there is some risk of claims by advertisement subscribers complaining that they are not getting a fair service, for example when compared with their peer subscribers.  I am not aware of any one making a case like this, but I wouldn’t be surprised if one would come up.   For example, a disadvantaged business may find from independent testing that its paid service is not getting as many impressions as a comparable peer.

Optimization of advertisement targeting is one of the big success stories of predictive analytics.   Another success is in the area of customer management and in particular customer retention.

There are many promotions that it is imperative that companies must employ predictive analytics in order to maintain their competitive edge if not to improve it.   The promotions describe the above advertisement targeting optimization scenario as an ideal goal to approach as closely as practical.

In my earlier post about a demonstration of big data for managing a transportation system, I talked about what I thought was missing in the presentation.   One of those omissions was the inevitable exploitation of the long term retention of operational data for optimization.   The promise of efficiency for such extensive instrumentation will mostly be realized by this exploitation of historical data.   Predictive analytics provide the tools for that exploitation.   Abstractly there is little difference between operating a train system and operating a online advertisement targeting system.    The predictive analytics can automatically change schedules, train-car allocations, elevator maintenance schedules, etc, all based on recently discovered patterns.

My earlier post on data systems optimizing retail worker scheduling provides an example where some form of predictive analytics is currently impacting people’s work lives in terms of how much money they can earn and how much disruption to their schedule they must tolerate.  In that post, I discussed the need for the employee to regain some control by having access to the same data that is affecting his work life.   As predictive analytics become more prevalent, it will affect more people’s lives and that will raise the demand for more scrutiny of the processes that produce this changes.   That scrutiny will include challenging the processes in courts or public opinion.

The point of this post is to raise the caution that there is a need for a serious discussion of legal and civil liability risks involved with predictive analytics.   When considering to introduce or expand the application of predictive analytics, the questions about risks of fraud, negligence, or other legal issues deserve careful consideration.  We should not accept quick dismissals of this concern as being irrelevant.   We should carefully and seriously address any and all suspicions that predictive analytics can increase the chances of legal or civil challenges.  This is very legitimate topic to discuss when considering the role for predictive analytics for any particular application.

2 thoughts on “Risk of predictive analytics taking data too far

  1. Pingback: On congress using CBO to deceive the public about the Affordable Care Act | kenneumeister

  2. Pingback: Risk of predictive analytics taking data too far | Hypothesis Discovery

Leave a comment