Using analytics to trespass

In an earlier post, I wrote that big data systems with analytic algorithms have similar vulnerabilities to malicious behavior as we experienced in the earlier era of networking and computer operating systems.  This post describes an example that is analogous to trespassing into other people’s news-feeds on social networking sites.

In that earlier era, we had a lax approach to allowing software to access resources in order to foster faster adoption and quicker return of benefits.   We later regretted that when people took advantage of the opportunity to manipulate the system for their unfair advantage or to impose an unpleasant penalty on others.   We have since tightened security for systems and networks with much more cautious approach to allowing software access to computer or networking resources.

Data analytics is software that runs on computers and networks.   I assume these computers and networks are secure to state-of-the-art standards.   There is a low risk of injecting malicious software into the system handling and analyzing the data.  We can control everything from the point of the data input to the presentation of the analytic results.

By necessity, the data itself originates outside of the security perimeter.  The reason we collect data is to learn about the real world.   We want to make decisions that are relevant to the current conditions and can influence the world in ways that we find favorable.   We need to see the world as it actually exists.

We design our data algorithms and collections with the assumption that the observations are unnoticed.

Many of the trial cases or early successes of analytics benefited greatly by the fact that the subjects were unaware of the analysis or the objectives.    We can, for example, assume that a crowd of people entering a retail store represents potential customers.   In this post linked here, I described how some retailers are using analytics to schedule its sales associates where they may need to be called in on short notice if business picks up unexpectedly.  The linked article in that post suggested that retailers are enjoying benefits of more effective scheduling with the use of data-driven scheduling accompanied by an agreement to be available on short notice in case business picks up unexpectedly.   The data vulnerability is illustrated when the crowd that shows up is organized in the form of a flash mob, they overwhelm the sales staff for a time with no intention of purchasing anything.  This would be an example of data hacking.

We can’t prevent this from happening.   We want to take advantage of a real burst of sales opportunities.   But once people are aware this is happening, they could organize a way to trick the algorithm into making a costly decision.

The obvious solution is to revise the data algorithms to distinguish legitimate crowds from false crowds.   This requires rejecting (or quarantining) some data because it is unexpected.   We determine the reasonableness of the observation by use of an model.   For example, the model may reject a crowd that arrives too rapidly.   The maliciously minded crowd could choose a strategy to come in more slowly, but this would be more costly to the ones organizing the this crowd.   I have been calling this rejection of observations as forbidden data.   Forbidding certain otherwise clean observations robs us of the opportunity to see something valuable.  The most valuable discoveries will be unexpected and appear suspicious.

I recall one time having lunch at a fast food place that normally expected the local office business that comes in at a reasonably staggered rate.   On this day, a tour bus (or two) stopped and let out their passengers to get lunch.   The crowd entered in a sudden stream that looked awfully unusual.  This occurred long before data analyics to schedule support staff, but there was a strategically planned staffing level based on historical data about a normal business day instead of this sudden peak.

In my scenario of a just-in-time workforce, the algorithm monitoring arrivals may conclude this sudden surge of customers is unrealistic (possibly some flash crowd of non-customers) and miss the opportunity of serving real customers.    We want analytics based on real observations to take advantage of unexpected opportunities so there is a natural constraint to how restrictive our data rejection algorithms will be.    This inevitable laxity in accepting data is what provides the opportunity for manipulation.

When people recognize that there is a predictable automated response to a certain observed condition, some will attempt to take advantage of that for their own gain as I described in the post linked here.

I further argued (here) that that the data itself is part of the algorithm.  In that post, I described the use of the dynamic data structure (sometimes called a dictionary or hash) to perform what could have instead been implemented in if-then-else type code.   Such algorithms based on data structures are dynamic and change depending on the data.   As a result, they are similar to the earlier hacking vulnerabilities of computers and networks.    The algorithm’s performance can change by a simple change to the contents of the dictionary.

It is with this background that I wish to comment on this recent article on Wired.   This is a description of a user choosing actions with the deliberate intent on exploring the analytic algorithms he knew were in place to populate his Facebook news feed.   This is an example of what happens when people become aware that algorithms are employed.   They will start to experiment with them.   The means at their disposal is to manipulate the data fed to the algorithm.    The article describes the dramatic change in his experience of the site after he liked a large number of pages.  I want to focus on his observation in the penultimate paragraph:

The next morning, my friend Helena sent me a message. “My fb feed is literally full of articles you like, it’s kind of funny,” she says. “No friend stuff, just Honan likes.” I replied with a thumbs up. This continued throughout the experiment. When I posted a status update to Facebook just saying “I like you,” I heard from numerous people that my weirdo activity had been overrunning their feeds. “My newsfeed is 70 percent things Mat has liked,” noted my pal Heather.

His actions were directly impacting the experience of others using the same service.   Apparently the impact was immediate.   The algorithms for deciding what appears in a news feed or for recommendations include information learned by associations with others.  In this case, the algorithms assume that a person will likely be interested in something that interests another person (or site) that a friend likes.    This works fine until the unexpected happens.  In this case a friend of many people changed his behavior.  This immediately caused a change in the site experience of everyone who have indicated they like this person.

This type of social-inference algorithm is common to most social networking type services.   The algorithms are based on the assumption that a person would welcome the same friends, contacts, or interests that their friends have.

One problem with the algorithm is the justification for using it in the first place.   I recall Facebook itself starting as a social networking tool for keeping in touch with classmates in the same school.   While the above algorithms are newer and probably introduced long after the service became much more widespread, the choice to implement this model seems to make the assumption that people have tight-knit relationships that are most common on college campuses especially those with large populations of on-campus residences.    In general, the friends will probably be in the same major and taking the same courses from the same teachers.   They also will probably share many interested within the bounds offered on campus.  In that scenario, it is not unreasonable to assume that friends of friends could be friends, or likes of friends could be attractive.    But that is not realistic for the modern population of users who often add friends who happen to meet in one-time situations such as on travel and agree to keep in touch.   While the individuals may share something in common, it is likely their interests do not overlap much beyond what they already know.   I don’t see any justification for assuming that a friend of a recently-met stranger could possibly be someone I know.   There is no basis to assume a friend who shares an interest in what I do for work will also share my interests in sports or politics.

One of the problems of analytic algorithms is that they are so appealing to the developers of the service.  The implementation makes sense in terms of how they see the world.   The implementation may even enjoy some success in terms of expanding the audience for a particular topic or a connection and thus increase revenue (somehow).

In effect, the algorithm is automating a friendly gesture of introducing someone to someone else without actually granting the permission to do so.   The algorithm is making a social gesture happen that otherwise may not occur.  Instead of leaving this option available to people, the algorithm is making it happen automatically.

I imagine a real life scenario of walking down a street and meeting someone who looks friendly and saying hi.   We exchange names and at that very second he invites me into his home, introduces me to everyone in his family and on the block and shows me his hobbies and turning on his favorite music.     While this is neighborly, it is certainly happening unusually fast to develop a relationship.  Building lasting relationships usually take time and some back-and-forth negotiation: exchanging a little information at a time.    The algorithm takes the fire hose approach of effectively saying “congratulations on adding this person to your friends, and here are all his friends and interests that we recommend for you”.

(As an aside, it occurred to me that this may cause a cultural conflict as well.  Some cultures are especially generous and eager to invite a new friend into a network of relationships an interests.   Those cultures may be insulted if that invitation is declined.   Since the most of the popular social media applications are based on US culture but are used world-wide, I wonder if the world-wide use of social networking may causing conflicts through misunderstanding of what is expected when someone accepts an invitation to be a friend.   That’s a different topic, but it is another interesting side-effect of automating the introduction-to-friends process).

The second oversight of the developers may be to assume that even if this may be a short term nuisance, that it would quickly stabilize after the user rejects introducing himself the hundreds of friends and interests of the new friend.   People tend to be stable in their network of friends and interests.  In real life, meeting new people takes a lot of time.   In social networking, it can occur quicker but it still tends to be a slow and steady process of a few additions per week or month.

The above linked article presents an example where someone does not behave normally and suddenly introduces a lot of new recommendations to all of his friends.   It is enough that they notice it.    This surprised the article’s author.  He didn’t expect that to happen.

However, another may deliberately use this trick to manipulate other feeds.   Such a person can spend months or years building a large list of friends and rarely adding new interests and only those that are not controversial.  On a particular day, this user will suddenly add a lot of interests that supports a particular point of view.   All of the friends would then get bombarded with a whole list of recommendations to check out these new interests of their friends.

Alternatively, he may target a group of friends and disrupt their network by suddenly adding a bunch of interests and thus flood all of the other news feeds with his interests and crowding out all of the other updates.

Recently, I noticed some marketing campaigns that insist on people liking their Facebook page.   One of the pitches was to help them reach 100,000 likes by the end of the week.   It may be an agreeable business so there is an eagerness to show support.    What may not be recognized is that now the social networking site is automatically recommending this particular business to all of their friends or contacts.   While one may want to show support for a business, they may hesitate recommending it to their friends or certainly to all of their friends.  For example, perhaps a valued friend happens to be very much opposed to your political views so you keep it quiet.  Liking a political view will present to the friend a recommendation to like it too.  It could lose a friend.  The algorithm doesn’t know this.   Everyone gets the recommendation of something that is liked by their friend.

The marketing campaign effectively hacked the algorithm to get their product advertised to all of the friends or acquaintances of those who happened to privately approve of the company or product.    This hack was not done with some type of software code.   It was done by introducing data through a campaign that challenged its fans to increase the number of likes of their page in a certain period of time.

The predictive algorithm is vulnerable to manipulation by third parties unrelated to the service or to the intents of the users.  I see an analogy with the early (and on-going) problem of secretly installed software on computers that can be turned on to flood a network at a particular time.   Only this vulnerability is done through a marketing campaign.   The damage or harm is analogous to a trespass and defacement where someone enters a property without permission and paints graffiti on the walls.

This may be a relatively harmless example with at best a nuisance result.  But in general, predictive analytics involve identifying clusters sharing something in common.   The goal is to target that group for some special treatment based on something that is predicted to be relevant to the whole cluster.   That predicted relevance comes from analyzing the data.  If people can learn or guess the algorithm, they can get the algorithm to apply something inappropriate for a targeted cluster.   Marketers found sufficient motivation find a way to manipulate social-networking news feeds to promote a “like” campaign to their pages advertised to friends of fans.   When stakes are higher such as allocating valuable resources (such as healthcare), there will be tremendous motivation for third parties to find a way to inject the data to get the algorithms to provide the allocations they prefer.


One thought on “Using analytics to trespass

  1. Pingback: A need for a new rhetoric for data, identifying fallacies in data | kenneumeister

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s