Datum Governance: Distinguishing bots from real world

Instantly after publishing my last post, I was informed that it had 4 page views.   It may be no surprise to a casual reader of my site that I don’t have a large user base waiting anxiously for my next morsel of thought.   This occurs each time I post a new post.   I’m pretty sure it is some kind of bot that is skimming my post for some purposes other than reaching human eyeballs.

I have similar experience in Twitter where a tweet receives instant likes or retweets that seem unexpected given my small audience.  Twitter is a large space designed for distributing content to unexpected readers.   The responses may be due to the fact that my tweet coincidentally used some word that someone happened to be searching for in the brief period of immediately after my tweet but before my tweet would be lost far down the search list.   Tweeting as a concept is all about rapid speech for a very ephemeral audience.   It is very unlikely that a stranger would be exposed to my tweets to be able to retweet or favorite the tweet.   Despite that fact, the reason I use twitter is for at least the lottery’s chances of having that opportunity.   In some cases, though, I can tell that the interaction is with a bot.  In one case the actual user name honestly disclosed its botness and its owner being a creator of such bots.

The problem of bots is that they are evolving to be harder to distinguish from real people.   People may begin to make decisions thinking that these cyber-connections are connections to real people instead of autonomous software.

The danger lurking in social media already is the concern and pain of losing followers, friends, or network-connections as a result of something one may do online.   When the connection counts decline after writing a comment, sending a tweet, sharing an article, etc, there is a pain similar to recognizing someone may have been offended so severely to want to break off all connections.   Increasingly, these events are due to bots leaving as part of some algorithm strategy that found the continued contact to lower its score somehow.  With modern practices of accruing large connection counts, it is impractical to know about the entity that has left or to understand why.

On a personal level of a human interacting with social networks, there is an emotional vulnerability that can be exploited by bots.   Many bots are easy to recognize as marketing automatons, after all their goal is to sell something.  However, it is easy to disguise an automaton as a human.  The profile can have a human photo and a credible biography.  Human like activity can be automated especially in the form of short comments, sharing of articles of a consistent nature, or of occasionally liking or forwarding a comment you may write.  This could establish a connection with a concern for the risk of breaking a connection.

Human-mimicking bots in social media can be designed deliberately to constrain people to conform to a certain point of view.  The bot’s sharing of articles and short comments can demonstrate the acceptable point of view and the unacceptability of alternative points of view.  The risk of losing that connection can provide a powerful incentive to restrain one’s participation to be consistent with the feeds coming from following/befriending bots.

This is an example of the deception that can occur in big data analytics studying social media feeds.   The past success of analytics of social media feeds benefited from a unbiased population.   People used the media without awareness that they were being studied, but also they were not being manipulated by deliberately deceptive automated friends or followers.   The early data was unbiased.  The early successes benefits in large part was due to this lack of bias.

Inevitably the future data will be more biased.  There is already some behavior modification as a result of better awareness of being studied.  People know their content is being skimmed for analysis.  People are more cautious about what they present about themselves or more cautious about tightening down privacy settings to restrict access.

I suspect people are also being influenced by automated friends or followers, especially those people who strive to have have large followings.   The larger the following, the more careful the person will be to retain that following through personal consistency (a personal brand) or through conforming to the clues of messages coming from the followers.  Large followings give the impression of celebrity and this in turn will cause a person to act like a celebrity, being more careful not to stray too far from the established reputation.   It is a set up for a trap, but one that celebrity-minded find comfortable.

Analytics of social media feeds for market research or social research will be subject to more biased data as the bots begin to influence behaviors.   In particular, an adversarial team may deliberately introduce bots to confound those analytic projects by carefully design bot-manipulations of real-people data to cause the analytic project to fail to find positive results, or to cause the analytic project to make a costly mistake about the population.

The present day enthusiasm of big data analytics is based on the power of analytic algorithms and computing power to process large data sets.   This enthusiasm creates a high confidence of the potential value of the analytics.   This confidence is bolstered by widely publicized success stories of early projects working with less biased data.   This confidence will continue into the future until there is a well publicized failure.

The threat posed by deliberately deceptive bots occurs in all projects employing big data analytics.  Present day analytics has vulnerabilities very similar to the cyber vulnerabilities we experienced being exploited throughout the earlier decades of computers and networks.   Data analytics are similar to operating systems where the vulnerability is deceptive data instead of malicious software code.

I have written several posts on data issues concerning health care.  The recent post (noted above) described how deception may be used to recover an identification of de-identified clinical data.   That post suggested a deception in the form of hiding identifying information within the measurements themselves such as a recognizable sequence of physiological changes recorded measurements.   That deception is analogous to the code-hacking technique of steganography.

There are other risks of deception to health care data systems.   In analogy to the above scenario of social-media bots masquerading as humans, there can be bots mimicking patients that can overwhelm tele-medicine centers or specialists working solely with diagnostic results.   Such attacks could be analogous to distributed denial of service (DDoS) attacks where bots consume all available resources (at a minimum the specialist’s time) so that real patients will not be able to get served.   As analogous to historic server DDoS attacks, the timing of the attack could be planned to coincide with a period of high demand such as during a period of large flu outbreak.  The attack could have measurable and tragic consequences.

Another form of deception by bots could be to generate a mass hysteria through social media bots pretending to be humans making first-hand reports of health conditions from epidemics or from contamination or hazard of commercial products.   False or unverifiable reports of hazardous products have been around for a while but with increasingly sophisticated bots mimicking real humans future reports could be more effective in causing mass panic.

I think we experienced something close to a mass hysteria last year during the peak of the Ebola crisis when cases started to appear in the US.   Even with the very few cases in the country, there was a great concern of degrees of contact from infected persons.   The public panic situation could have been far worse if there was deliberate deception of false but believable first-hand stories from individual bots reporting Ebola symptoms and where they were before they went to the hospital.  Multiple such reports arranged at the right time could cause major disruptions for a period of time before health officials can reassure the population that there is no problem.   Such a manufactured panic would also divert resources away from where they could be providing more urgent needs.  For example, health officials may be diverted to initiate labor-intensive contact tracking of fictitious cases.

Although I introduced the term datum governance recently, it is really just another name for a theme I’ve been discussing throughout this blog site.  That theme is for the scrutiny of the data itself as opposed to the technology for handling the data.  Assume that all of the data infrastructure is secured and operated competently.  There remains the problem of deliberately deceptive data entering the data infrastructure.   Deceptive data is not a form of dirty or corrupt data.   This is data that is bright clean data of actual observations.   The deception is that this data does not capture the real world.   An attacker can deliberately coordinate deceptive data to mislead the analytics to suit the attacker’s own purposes.

On this blog, I have expressed my skepticism about the ability of the data to inform me about what is going on in the real world.   I feel that the bulk of the labor for the data clerk (aka data scientist) involves continuous challenging of the data to verify it is capturing the real world.  Datum governance specifically combats deliberate deception.  Deception does not get enough attention in big data because we assume the criminal’s contribution will be insignificant.  With the reality of programmed human-mimicking bots that can influence public behavior, there is reason to worry that bots could impact big data in ways much like earlier hackers exploited computers and networks with viruses and worms.

I think the deception may already be having an impact on our current democracy through the large investment is frequent polling of public opinions on current issues.   The first problem is a subtle accidental deception due to change in the nature of information from polls.   Earlier, when the polling practice was still young and results were not widely disseminated, the polling captured opinions that were not influenced by prior polls.  Within the confidence intervals of the polls, the polls were measuring actual isolated opinions of individuals.   Lately, the polling has become more frequent and more widely publicized so that a current poll would occur within recent memory of reports of an earlier very similar poll.

The earlier polling information influences the opinions of people in the current poll.

It is a natural tendency of people to want to hold (or at least express) opinions that will conform to the majority view.   People generally do prefer to keep the peace so they can concentrate on their private lives without causing controversy.   They will want to give the same answers they think everyone else is giving, especially for topics that have limited relevance to the daily lives.   A recent example is the polling for whether wedding-businesses should be allowed to decline their services to same sex weddings based on their religious concerns:

Public opinion has shifted on the issue since last fall

The opinion changed so quickly that it caught the Indiana State government by surprise when they tried to pass a religious freedom law modeled closely on laws passed by many other states.   It is highly suspicious to me that public opinion can change this fast especially concerning something involving religion and a long-established institution of marriage.   I suspect the later polls were capturing the influence of earlier polls that informed the public of a changing majority and many wanting to be on the side of the majority to avoid making this into a bigger problem.   The seekers of the freedom to decline services are a small minority and so this does not seem to be worth the bother of making it a major issue.  Most people see no benefit to take the side of such a small minority.

My point here is the remarkable speed in change in public opinions.  I don’t think this change would have happened in an earlier era where polls were less frequent and less publicized.   I suspect the polls are influencing the public opinion to quickly settle this issue and get it out of the news.

As a data scientist, I find this worrisome.  The point of data analysis (such as polling) is to observe the real world.   The growing sense that the debate is settled mostly like is deceived by data of a population that merely wants to go along to get along.  It is more likely that the opinions have not changed that much.  When it appears safe to express their real opinions, the debate may be exposed as not settled after all.   When that happens, we will face serious problems of a backlash.

Making a dramatic change in how the government respects a long lasting institution like marriage should only be made with certainty that the public opinion change is real and permanent.   Making such a change on a deception can set us up for for political disaster in the future.   The current events of other culture’s rapid changes in treatment of gay people (or people accused of being gay) should make us very cautious about our confidence in our culture’s tolerance for marriage equity.

The above example of polling is an accidental deception.   The topic has attracted a lot of media attention that in turn educated the population of the current popular opinion.

The same mechanism is available for deliberate deception.   I think this has already happened within recent government elections where frequent polls had the effect of elevating the priority of issues by publicizing the results.   The most recent (2012) presidential election’s theme of a war on women seemed to be very effective for the election and yet based on an issue that would otherwise be a very low priority compared to other issues facing the country.   If this were a deliberately deceptive use of polling, it at least had the advantage of being controlled by a major political party.

It would be far more troubling if this deliberate use of polling could be exploited by a very small fringe party.  Imagine if the successful war on women issue benefited neither of the two major parties but instead benefited some extremist party.   The votes for this low-priority issue could be used to bring into power a new party with an agenda that most of the population will not welcome.

The mechanism for influence on polling is the perception that the majority of the population is feeling a certain way.   In the past, this tool was only effective when used by a party with a significant popularity within the population.   With recent popularity of online social networking and sophisticated human-mimicking bots, this manipulation may become available to minor fringe parties.   As described above, the bots could create fake friendship connections and advertise their preferred viewpoints required to maintain that connection.   People will subconsciously adapt their thinking to avoid losing their online friends and ultimately begin answering polls with consistent answers.   The polls then would inform of a growing popularity for a particular issue and this will encourage even more to conform with that issue.   Eventually, the issue of the minor fringe party may come to dominate the debate and perhaps affect the outcome of an election.

Data deception is a concern for automated decision making based on data analytics (such as in my hypothetical dedomenocracy).   I think it is already a concern with our current democracy.  I fear the current enthusiasm for data technologies because I do not see much in the way of appreciation for the possibility of deception.   There is a huge confidence in the combined power of large amounts of data and sophisticated statistical tools (such as machine learning).   Missing from our consideration is how well the data actual captures the real world.  The data is not necessarily an honest representation of what is happening in the real world.  It is very possible that the data may include deliberate deception.

Deception in data is like the human-mimicking bots appearing in social-networking sites.   Intrinsic in the data technology is the ability to create such bots.  The increasing sophistication of the bots make the deception very difficult to detect.   Even if we can detect a deception there is no easy way to remove it because deception replaces true-to-nature observations instead of merely contaminating a recoverable truth.

I worry about deception in data because the consequences of deception could be huge.   In a government context, the data consequences could include restricting people’s rights (such as the religious freedom debate), harassing individuals (such as the recent UVa rape scandal), or cause widespread rioting.  Despite abundant examples in recent history that can be attributed to deception in data (possibly deliberate deception as well), there is little discussion about the problems of deception in our increasing data-driven world.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s