One way to summarize a recurring theme in many posts on this blog is to question how well we can trust the data. My initially naïve definition of a data scientist centered on the skill to challenge the trustworthiness of data. I later gave this definition my invented title of dedomenologist because the popular term data science appears much more focused on the technical achievements possible if we take and optimistic view of data being trustworthy or at least mostly trustworthy.
I concede the there is value to the popular terminology of data science. We need to explore how far we can go with the assumption that the underlying data is trustworthy. This effort is primarily technical in nature either in terms of developing faster or more specialized algorithms or in terms of building easier to use platforms to make these tools available to a wide population of users. Underlying these technologies is an assumption (mostly) that the data quality issue is solved elsewhere.
Data tools do address quality to some extent in terms of checking for common mistakes or errors that can occur with data. This level of data quality accepts the basic trustworthiness or acceptability of a particular data type or source. Data quality is about cleaning up errors or mistakes that occasionally occur with this data. Although narrow in scope, this definition of data quality is very challenging because the nature of errors may vary based on data type and will vary over time as technologies change.
In earlier posts, I expanded the definition of data quality based on varying levels of trust in data types themselves. I presented a taxonomy of trust in data using my own terminology for the sake of presenting the argument.
- At top of trustworthiness is bright data, data that is well documented and controlled. This this type of data is very rare.
- Most data types present some doubts about documentation and control. My taxonomy labels this as dim data. Most data is dim, but there are varying levels of dimness. We trust dim data but we have to verify it. This is close to what I described earlier as a common definition of data quality efforts.
- Many times, observations are either missing or impractical to obtain. In their place, we substitute data generated from computer models based on accepted theories. Like dim data, we trust this model generated data to varying degrees. However, I set this data apart from dim data by calling it dark data. Unlike dim data, dark data presents no new observation about the real world. It only presents a derivation from other data what a particular observation must be if our understanding of the real world is correct. This type of data introduces the possibility of doubting the validity of our understanding implemented in the models.
- A cousin to dark data is what I called forbidden data. We reject forbidden data because the models predict that such an observation is unlikely. Forbidden data includes what is rejected in our quality controls of dim data. Like dark data, we can challenge the trustworthiness of the models. The data may be incorrectly rejected by invalid models.
- I then presented other data sources that are present but may be irrelevant. I gave them names such as accessory data (bright data that has no relevance) or unlit data (data that has never been scrutinized). The problem with these data is that they are irrelevant but they are present in the data store. Their presence in the data make them available to machine-learning algorithm to discover spurious patterns of irrelevant data.
My earlier posts were an attempt to draw attention to this broader view of how data can mislead us. My view of data is to equate it to evidence. As evidence, the credibility of data can be subjected to the type of scrutiny encountered in courts of law. I argue that there are many more ways that doubt can be raised beyond the simple fact that sometimes a mistake or error may appear.
I see the topic of data governance presented as a solution to data quality. My lay understanding of governance is that it is an agreement imposed between parties in a data project to establish their obligations and duties in terms of delivering and handling data. The construction of this agreement provides the opportunity for all parties to scrutinize the full breadth of data quality issues before committing to the agreement. Although I didn’t mention data governance explicitly in earlier posts, I did argue that this kind of up front agreement is insufficient. Many of the problems, especially concerning model-generated or model-rejected data, will gradually emerge over time as we learn more about the world (or as the world surprises our previous understanding). We need a way to continually fine tune this governance agreement.
Another justification for data governance is to address the problem of the parties taking advantage of each other for their private gains. A data provider may decide to save money by providing less frequent updates. A data consume may decided to make money by reselling the providers data. The governance also is a contract to prevent such deliberate abuses.
Part of the motivation for data governance is the recognition that trust between people cannot be assumed. We need to establish a trust relationship with people before we can assume trust in their participation. Data governance is a contract that formalizes such an agreement of mutual trust.
All of the above addresses the trust issues involving the exchange of privately held information. In my experience with computer network management, this broadly describes the trust we have in collecting machine-generated data at a centralized network management system. The data is machine generated and closely held by that machine. We agree to allow this data to be shared based on established agreements about who can access it and for what purpose they will use this information. The overall system is closed between the participating parties. As long as everyone involved in the contract honor their commitments the data quality can be assured.
With big data projects, the data is observations that originate outside of the bounds of data governance. Many big data projects are focused on observations of individuals or groups of individuals. These individuals have not entered into any data governance agreement and yet they are essential elements to the project. As admitted by one of the justifications of data governance, we can not trust people until or unless we establish a trust relationship.
People will deceive.
To me, the particular theft crime of embezzlement stands apart from other forms of theft. This kind of crime is does cause financial harm to others, but the embezzler accomplishes this act without anyone noticing. Often, their actions are actions they are authorized to do or not physically prevented from doing. The crime is in their intentions of these otherwise permitted actions.
Personally, I’m an extreme optimist that sees most people as being trustworthy despite my experiences of evidence to the contrary. Those who deceive are a minority. Also, as such an optimist, I have a low tolerance to inconvenience imposed on the majority of trustworthy people in an attempt to prevent the minority with criminal intentions to carry out their plans. However, I’m sufficiently a pessimist to concede that there are people who have unfair intentions, that they have impressive capacities to carry out their plans, and that often their plans can be very effective in meeting their unfair goals.
My complaint about the popular hype about data is that its promotion appears to disregard the possibility that the subjects of observations may have unfair intentions and the cleverness to carry out their plans.
I recall a recent story of academic publishing groups having to retract previously published papers after someone showed that they papers were generated gibberish by a computer program using the words and word patterns found in the particular discipline. I don’t think this story got as much attention as it deserves. Someone was able to successfully publish nonsense papers at a conference that the article states claimed all manuscripts were “reviewed for merits and contents”.
I attempted to raise this concern in an earlier post where I described the big data predictive analytics projects as analogy to the nascent computer and networking technologies of the 1980s before we realized that some people do not play nice. The early 1980s euphoria was tempered by decades of pain after repeatedly discovering the resourcefulness and effectiveness of people who are intent on taking personal advantage at the expense of the majority. We seem to never learn that some people don’t play nice, and those people can be very clever.
In the above manuscript scenario, I image there really were human reviewers of the paper but they we so accustomed to the trustworthiness of their population that they skimmed the articles for little more the spelling or grammatical errors (effortlessly avoided by computer generator of the text). The humans were effectively acting as algorithms, seeing the expected patterns to classify a paper as a valid addition to a conference.
As an aside, I appreciate that there is a need for some leniency for conference proceedings especially international ones involving many languages and cultures. The goal of the conference is to invite a diversity of qualified participants to interact in person. This goal needs to extend some benefit of doubt to the submitting parties. The problem is that sometimes that extension of benefit of doubt can be abused. Perhaps I do have a complaint in their response that definitely withdrew the papers and probably imposed more rigorous review processes that will unfortunately exclude some potentially ground breaking work because it will fall too far outside the expected. In my opinion a better response would have been to leave the papers alone and then post a disclaimer acknowledging that some gibberish papers may get through the system. In fact, this could be an opportunity to turn the occasional mistake into a feature by offering prizes to the participants who can spot the gibberish submissions.
My motivation for this post however came from a recent local news stories alerting us that there is a new group of bank ATM skimmers operating in our area. The article went on to describe that we should be cautious when using ATMs to be sure there is no suspicious looking things attached nearby. The image that came to my mind is of some crude metal box duck-taped next to the card reader. If I see something like that, I’ll be sure to find a different ATM. But then I searched for images of real skimmers. I was shocked at how well they are disguised. These are complete veneers that fit the ATM and perfectly replicate the appearance of the machine. The signs of suspicions would be to observe subtle seams at the distant edges. I don’t doubt I would be deceived by many of the designs.
At the same time of being shocked and alarmed, I was also envious of the skills and resourcefulness of the perpetrators to build and to hide what must be an extensive supply chain to manufacture such exquisite replicate veneers. Apparently there is big money in skimming and then copying code to blank cards. With that at stake, there will be resourceful attempts to take advantage of it.
This is an outrageous crime and police aggressively attempt to track down the perpetrators. However, I wonder exactly when did the crime occur? Historically, a bank fraud occurs when bank funds are actually transferred. This act, if it occurs at all, occurs long after the act of skimming. The production of replicate veneers of ATM interfaces is not obviously a crime. Neither is carrying such a veneer around. Attaching the veneer to private property of the ATM may be a crime but it is minor crime of trespass, defacing property, or vandalism. They get this far because there isn’t really anything to stop them from getting this far. There are (or were) no warning signs forbidding taping something to an ATM face. As far as I know, the ATM face does not have any sensors that would lockup the ATM if it senses unauthorized attachments. Such sensors would be difficult to build for ATMs that are exposed to the weather.
The act of skimming ATM card appears to be outside of the ability of the banking IT system to prevent. The IT system may be completely secure up to the edge of the ATM interface, but at that point the security protection ends. At that point, we have to rely on the assumed trustworthiness of the public to not attach skimmers to the face of the ATM.
I look at these examples as warnings about trust in data that goes beyond the topics discussed at the beginning of this post. We have to worry about the trust that observations fed to the sensors are real observations. In the above bank skimmer case, I can imagine an alternative attack of the veneer rapidly entering 3 wrong PINs to lock the owners card and deny the owner the possession of the card and the ability to spend cash. Done repeatedly, the ATM’s capacity to hold rejected cards will be exceeded and result in some beneficial outcome for the perpetrator. The card lockout algorithm assumes the PIN was entered by the possessor of the card, but this is not necessarily true.
I read of increasingly sophisticated display technologies such as the recent Google Glass and talk of a display built into a contact lens. These very promising wearable display technologies could also be used in malicious ways. Someone can come up with a camera-lens cover replicate veneer that is actually a display that will interpose computer generated imagery to deceive a surveillance camera or trick the image processing algorithms. In an earlier post I included a link to a presentation that included a demonstration of an algorithm counting people in an image of a train platform. What if the imaging camera had a stealthily applied lens cover display that occasionally displayed a recording of an earlier image of crowded or empty platform. In that demo, there was security everywhere between the point of the sensor and the control room. However, the image itself is prone to deception.
In another earlier post I mentioned the case of automobile license plate readers. Someone could temporarily apply for a single trip a vinyl covering over the license plate to replace or obscure the plate information. Someone could apply the vinyl skin just before crossing the line of sight of a fixed reader on private property and then remove it before entering public streets. Alternatively, someone can use a small device to interfere with a motion detector to cause the image to be taken at the wrong time for a good image of the plate or overwhelm the system with empty images when there was no traffic.
I do not need to know why someone will do such things. I only know that it seems readily possible. The same technology booms that are enabling the Internet of Things with small cheap devices with dense computational and memory capacities can also enable deceptive objects. The deceivers have the advantage of deploying new concepts faster than the established systems can deploy counter measures. Established systems have a burden of adhering to data governance and associated time consuming development approval processes. The deceivers are not that burdened.
In hindsight it seems inevitable that ATM skimmers would appear. The ancient civilizations could have predicted it. Humanity always produces a sizable population of very effective deceivers. Yet, we seem to never really learn this lesson.
The current hype of big data and predictive analytics seems to assume deception either doesn’t exist or will be impossible. Here I am again optimistic about humans but this time in the negative way of being confident in the brilliance of human deceivers to find and exploit ways to deceive established systems.
One of the frequent examples for exploiting big data analytics is in the improvement of health care delivery. There are already early reports of great improvements when data is analyzed. My reaction is to point out that those studies where done secretly. The deceivers did not have an opportunity to act. Eventually it will be common knowledge that our data will be part of large data stores that machine learning algorithms will use to come up with optimal recommendations. Also, eventually it will be readily accessible knowledge that these algorithms work by identifying clusters or categories of individuals using more dimensions than a human mind can visualize. I don’t doubt people will find a way to game the algorithms just like the above example of the computer generated gibberish papers getting past the publication review committees.
A recent news story come to mind about a young girl needing a lung transplant but was too young to qualify for the more readily available adult lungs. This case was resolved controversially but ultimately legitimately with a court order. There was no deception involved. However, the case illustrates that the area of organ transplants is highly contested because of the much larger (and often desperate) demand compared to the supply of suitable organs.
I can imagine a future scenario when some adult needs a liver transplant. He would prefer to get one sooner so he can recover sooner, but it turns out that the algorithms place him in a large group that is considered lower in priority. This group is told to expect a transplant in 1-2 years instead of 1-2 months.
Higher in priority are machine-learning generated goups that have more urgent need but more specifically have smaller populations. It is preferable to allocate to a group with a small population so that nearly all can get the transplant and thus avoid the risk of discrimination lawsuits. The above disappointed patient may observe that the priority calculation includes group membership size. He reasons that if he can change his circumstances to place himself into a different group that is smaller in number than his current group he could benefit from secondary consideration because the group size is so small.
He figures out (some way) that one of the miscellaneous dimensions of his group includes recent use of electronic cigarettes. This is a data point available for the machine-learning algorithm even though (let’s assume) everyone agrees it is harmless. The point is that many people in his current group are former tobacco smokers who have since stopped. Due to their present health conditions, this group is especially reluctant to take up e-cigarettes. He decides to start this habit and informs his doctor (perhaps it is confirmed with a test for nicotine in body). The next month, the health providers rerun their machine learning programs and it successfully reassigns this patient to a smaller group whose needs are no more urgent than the larger group. However, due to a slight surplus of donor organs, the algorithm picks this group after the highest priority are served because this group is just small enough for the remaining number of organs available. The patient gets his transplant earlier.
In this fictional scenario, I can not fault the patient. He was doing the same thing the health providers were doing: exploiting big data analytics for private benefit. However, if this were to become a real case and the scheme is discovered, I suspect we would consider it to be criminal.
This post went too long to make a simple point. Data quality and governance only controls the data within the system. When we extend the data collection to measuring people we need to be aware that people can be very resourceful deceivers once they figure out the new algorithms being used. For big data promoters, the evidence of past successes may be misleading because the deceivers have not yet begun to game the system. The deceivers will come later.