When I started this blog, I spent many posts talking about how I approached my work with a large data project. I attempted to defend the labor intensive aspects of data science that is contrary to the modern enthusiasm for ever faster processing and more automation. Although it is claimed that data science is a hot field for employment opportunities, the employment opportunities are mostly in the areas of tackling ever bigger data sets with ever faster speeds and less need for human inputs.
Recently, I watched some presentations about working with new technologies. Invariably, the presenters excitedly showed how quickly they were able to find some interesting piece of information based on obvious distinction of the data. The presenter then proposes an explanatory story. The example is always a generally familiar topic so that most people will be able to follow along. Given this proposed story, the presenter than pulls up some data to support the thesis by affirming the consequent. Although this is not proof that the story is correct, invariably the presentation gives the impression that no further proof is needed.
With modern speed of data retrieval, analysis, and visualization, we may be encountering a new form of logical fallacy of appealing to authority where the authority comes from the speed at which we can present affirming data for our theses. Assuming that human behavior is a product of evolution, there has not been enough time for evolution to adapt to the new reality of nearly instant affirmation of some consequent. Historically, we recognized a pattern that we can trust affirming data if it arrives quickly. Before modern data technologies, the speed of finding affirming data was an indication that affirming data is abundant around us so it didn’t take long to find. That particular mode of thinking is no longer valid with modern data technologies. The instant access to a wide variety of data makes it possible to find affirming data very quickly. It will take a few generations for evolution to catch up to teach us to not trust speed of affirmation as proof of some hypothesis.
In my earlier posts, I also argued my observations from watching a continuous stream of data over many years where previously accepted explanations eventually no longer were valid. When working with streaming data over the long term, I changed my attitude about causality and precedent. Previously proven stories will eventually be proven incorrect. From an older notion of causality, if some old theory is proven wrong than everything in the past based on that theory was wrong. The problem with that logic is that in the past the old theory was effective at making decisions at the time. That same theory is no longer workable because something in the world has changed. The present data is not as helpful as it was in the past. My argument for a labor-intensive data science to scrutinize even highly data out of suspicion that the world can change at any moment.
Back to the example of the technology demonstrations, it may be correct that the quickly discovered story from the data is actually true. For that time, the story is true and its immediate affirmation was sufficient to prove its correctness. My concern is the same story will not be true forever. The world will change. This is especially true when the data involves human behaviors. People will begin to behave differently as they become more aware of the data trails they leave behind. As more people have access to data, population behaviors may begin to change at a faster pace so that even common sense will become obsolete.
My fear of the current enthusiasm for data science is that soon a prediction based on common-sense story-telling with quick affirmation will turn out to be catastrophically wrong. Besides the fear of the catastrophe itself, I fear a backlash against all things data. Data is evidence. A popular backlash that rejects evidence in arguments will result in a scary irrational world. That’s my fear of the over-confidence of trusting the data without time-tested but labor-intensive scrutiny of the data.
My blogging involves abstract discussions that I already admitted was primarily talking to myself. I was making references to my private experiences. Most readers will not even notice I’m making such a reference. I was sufficiently satisfied to be blogging. I liked writing down my thoughts and conversing with myself.
You may notice a recent gap in the blogging history. I redirected my attention on learning instead of writing. My goal was to come up with some idea to make it worthwhile to invest in some cloud-computing time. Since I’m currently not working, I’m being very cautious about how I am spending my money. Even though cloud computing is affordable, I want to make the most of it by having something immediately interesting to work on.
I found some realistic data at capital bike share. This is a local bike-rental system. Around here, they have been very visible with many bike stations and many people riding on the distinctively designed bikes. Thus I have some familiarity with what the traffic would be like. I imagined that the bikes are ideal for what I call a last-mile transit from the metro transit system to a final destination. The distribution of the bike stations appeared to match this idea of using it for point-to-point transport from the metro system to a place of interest. The bikes are heavy and not very fast so I expected the trips would be limited to short trips and mostly during the daylight hours or business hours. The system supports both casual users and paid membership models (casual or registered users). It doesn’t take many trips to justify buying the membership so my guess was that there would be little difference in usage between the too populations. The above data identifies whether the rider is registered or not. Otherwise the data is anonymous.
I found the data appealing because it is similar to the data I had worked on previously. Each record is of a particular trip with specific starting and ending points in time and space. The trips occur at any time so that the trips do not line up in convenient slots for aggregation at intervals less than 1 day (the traffic pretty much stops in early morning hours). I wanted a way to explore the data during the busy part of the day.
Similar to my last project, I approached this data by subdividing the trips into a globally consistent time intervals. I eventually measure the bikes in motion (not docked at a station) in each 10-minute interval of each hour. I wanted a finer resolution but I found that 10 minute quantization resulted in a similar number of rows of data, about 800,000 records. That is small enough to process on my laptop and my non-pro version of Excel.
My primary motivation for working with the data was to practice with technologies I had not used before. I processed the data with Python scripts when previously I had used Perl. I’m still fond of Perl, and I had stubbornly stuck with it long after others tried to convince me to move to Python. Nonetheless, I’m comfortable with Python. It reminds me of what I liked about Fortran (another language I loved longer than I should have).
I also wanted an opportunity to play with Excel’s Power-query and play more with SQL Server’s Analysis Services. In contrast, my work experience was nearly exclusively around custom scripts and web-reports that presented my data. I had a steady stream of new tasks to tackle and despite the increasing obsolescence of my toolbox I was able to keep up with the work load. I would not want to do any new project the same way.
So my first task was to process the data to re-aggregate the trip data into 10-minute time bins. My goal was to use arbitrary summaries on this one-time global aggregation. In order to get the counts to work out right for any future summary, I interpolated interpolated the bikes over the time bins while the bike was between stations. Thus a bike in use from :05 to :25, I divided the bike into 3 pieces: 1/4 of a bike for :05-:10, 1/2 of a bike for :10-:20, and 1/4 for :20-:25. For any 10 minute interval, there will be some fraction of bikes in the interval, but the goal was to get the right number of bikes when I added up for all trips between two stations.
My first discovery was that there were a lot of trips that were lasting longer than 10 minutes. My assumption was that the bikes primarily would be in use just for station-to-station transit. Many bikes were checked out for multiple hours, and there was one bike that was checked out for 59 days — I found out because the duration in milliseconds overflowed my 32-bit signed integer (SQL Server’s int column). Clearly these bikes were not in continuous use for that entire time and there is no way to know what is happening in between the stations.
This scenario was directly analogous to the data I used previously. In my last job I had access to far more data so a simplistic approach of assuming a constant usage over the entire long interval worked well because there would be many other overlapping trips to add for aggregated data. As in my last project, I am not interested in individual bikes but instead the entire population of bikes that are moving between stations. However, there are far fewer data points with this data.
To continue to project of practicing with new technologies, I made a quick decision to divide the trips between non-stop trips and trips with stops. If the trip was longer than a certain time, I assumed the trip involved some kind of stop not at a bike station. Again my initial assumption was that most trips were short and with a deliberate destination of the terminal station. This would argue for distinguishing a non-stop trip to be one that lasts less than 30 minutes, otherwise the trip must have had a stop. Even better would be to compute the distance between the station and assume a stop if the average speed was less than say 3 mph. That is probably a realistic assumption but the data was showing too many trips with stops.
I began to imagine stories for what is going on. This is a town with a lot of tourism and has a lot of long relatively flat bike trails. Some people may be taking long rides on the bikes: tourists to ride by the various landmarks, and locals for a leisurely ride. Obviously, also some people were using the bikes for errands that would involve stopping at some store (or perhaps visiting friends or family) where the bike would be parked away from a station. The data offers no insight on what might be going on with these long rides. To have a reasonable mix of trips between non-stops and with-stops, I extended the threshold to 3-hours. If the check-out time was shorter than 3-hours, I assumed a non-stop trip between the stations. My goal was to observe how many bikes were in use at any time, not to measure the speed of the bikes. If the trips were longer than 3 hours, I assumed 2 trips: the first 1.5 hours from the start and the last 1.5 hours to the end. In the middle I assume the bike was parked at some unknown location. For any particular case, this was probably not accurate, but when adding up all of the trips over a 3-month period, the numbers should work out better.
The data also included bike station location data (latitude and longitude) that I used to compute the distance between the stations. I could have used this information to compute a different threshold for non-stop trips based on average bike speed but I kept that algorithm simple. Instead I wanted to confirm that the trips were mostly short distances characteristic of a last-mile transport to/from mass transit.
Using Excel, I found the following interesting summary charts. There are two populations of users of the bikes: the ones who pay with credit cards at the station and the ones who have subscriptions so they can use a provided key. These users are labeled as “casual” and “registered” in the data.
The first chart for registered users more or less confirmed my guess: most of the ride were within 2 miles and with a generally consistent duration for the distance involved. This data combined all of the bike station data over the entire metropolitan area over a 3-month period so it averages a variety of road and traffic conditions. The results are what I would expect. The series legend is distorted by some very long trips. The 4 main bands in this chart are 10, 20, 30, and 40 minutes (on average about 6 mph).
In contrast the same chart for casual users was similar in terms of distances but much more varied in average speeds. Many rides end up where they started and even those that end up in different locations sometimes take an hour or longer to go one mile. These are not point-to-point trips.
Two surprises to me was that relative size of the populations. Registered users accounted for 10x the number of trips as the casual users. That’s impressive considering this data is from Jan-Mar 2015, the winter months when biking commuting would not be as popular. The second surprise was the long tails for trip duration and distances traveled. Many people are checking out the bikes for long intervals and using it for some decent distances considering the design of the bike.
To explore this data further in Excel, I divided the distances into bigger buckets to distinguish different usages based on time of day and distance traveled. For the bike station at my local metro stop, the pattern confirmed my casual observations that the most common trips were commuting to/from nearby residences to the metro stop (“A” in “A->B”).
In contrast, the pattern looks different at other parts of town. For example, the Bethesda metro station is similar to Ballston and yet the bikes are not used as much for getting to/from the metro. I’m not as familiar with that metro stop and its surroundings but my impression is that it is comparable to Ballston. I would have expected a similar interest in using bikes for getting to/from the station. If such commuting is occurring, the trips are longer the 5 miles. Most of the trips are 1/4 mile or round trip (assuming the station lat/long data are accurate). In any case, the service is more popular here in Ballston.
My goal for analysis was to get this data into SQL Server Analysis Services to explore the entire set of stations. With 350 stations, I wanted a way to aggregate the stations into geographic clusters arranged hierarchically so I can see a summary with just a half-dozen major clusters that can be drilled into to see smaller clusters until I can reach individual stations. To get this, I implemented a simple cluster finding algorithm (unsupervised machine learning) that separated the stations. I repeated this 3 times to have clusters of clusters of clusters of stations. The resulting distribution shown below clearly needs more refinement but it suits my purposes for continuing my little exercise.
Also, in the interest of seeing both directions of traffic, I relabeled the trip end-points from “start” and “end” to “A” and “B”. For any pair of stations, the “A” stations have name that occur alphabetically before the “B” station. This will allow me to have a direction dimension of “A-B” or “B-A” and this allows an analysis such as follows (using Reporting Services) to observe imbalances in flows of bikes such as what happens between the Arlington-Courthouse cluster and the larger cluster labeled as “Chevy Chase” (in the empty space in the above diagram). There are only 530 trips between these clusters over the 3-months, but generally these seem to be long bike rides with better metro alternatives.
Again, I’m just playing with the technology and studying this out of my curiosity stemming from my familiarity with my part of town. I made a separate report to explore the trips between these large clusters further.
Most of the trips for the top-level cluster pairs are from the Bethesda to Arlington second-level clusters (the O symbols in the cluster distribution chart) but the traffic is pretty consistently in afternoons for every day. These are summarized over about 13 weeks so the midweek trips may be from a single commuting individual. As separate report (not shown here) seems to confirm this as the trips mostly start from the same station. The problem with the “Arlington” 2nd-level cluster is that it includes this station on the other side of the river in Georgetown. A better cluster algorithm should account for the river being a natural barrier.
It turns out this bike station is at the bridge so it is not inconceivable that the commuter works in Rosslyn (a denser office building location) and either enjoys the walk over the bridge or finds the bike stations at Rosslyn empty at that time of day. Here, I’m demonstrating story telling. The data says nothing about whether this is the same person or where he started from. My experience from working with similar data is that if there is a recurring pattern for a long-distance pair than it is mostly likely a single contributor. Even from Georgetown, a bike commute with these bikes to Bethesda would be a decent exercise given the hill involved.
Another analysis I explored involves the different trip distances (near round-trip, last-mile, a short commute, and long commute) to see the differences again between registered (gold bars) and casual users (blue bars). The bars are 2-hour blocks and show the popularity of subscribing into the bike program relative to casual users. Tourists are more likely to be casual users and this is consistent with the chart below showing them to be most comparable in numbers with registered users for long bike rides of stations separated by more than 5-miles.
Drilling into the rows further distinguishes the registered and casuals (being comparable in difference between locals and visitors). For the trips between stations separate less than .25 miles there is a big difference in behaviors:
As I discussed above, I distinguished long-duration trips of over 3-hours has making an unknown stop without redocking the bike. These show up in the chart as “begin” and “end” categories. In story telling mode again, either the casual riders are going to destinations that are too distant from nearby bike-stations (indicating an unmet need for more stations), or the casual riders are reluctant to make multiple credit transactions during the same day. Again, the data is silent about the underlying reason but there is clearly a difference in behavior between registered and casual users for rides between very close bike stations. A similar pattern exists for stations separated by 1.5-5 miles (note the distance rows are labeled with the lower number of the ranges 0-2.5 and 2.5-5). When registered users use bikes, they are more likely than casual users to go directly to another station rather than to stop somewhere in between.
The last chart I want to show is the distribution of distances of bike trips within a high-level cluster. When the both ends of the travel belong to the cluster (charts along the diagonal in the following table) than the distances should be shorter than when one end is in another cluster. The chart of distance distributions (in .25 mile units) shows that there is considerable overlap in distances for trips within clusters and trips between clusters. This is a consequence of the goal of having multiple clusters in a proper hierarchy so that lower clusters are always proper subsets of higher clusters. Two nearby stations may be assigned to separate clusters that are then assigned to even more remote higher-level clusters. Proper hierarchies are a convenience for analysts: it make computation and visualization easier. These are consistent with the modern obsession of data science for faster computer and more effective visualizations. The convenience however has a disadvantage as illustrated here where nearby stations are assigned to widely separated higher-level clusters and thus biasing the interpretation of the data.
For this post, I wanted to defend my recent absence from blogging to show that I’ve been playing with some data. This example is also a useful analogy of the kind of data I was working on before. In future blog posts, I’ll use this post or additional reports to illustrate some of my arguments about the risks of being misled by the abundance of data.