In my last post, I described that the term dark data has been used long before I started using it and that earlier use has a very different meaning that I’ve been using. Although I rambled on describing my experience with both kinds of problems, I pretty much gave away my limited authority in the field. I don’t claim any authority in data science, only some extensive and intensive experience mostly figuring things out by myself.
My intention of this blog is primarily to give myself some work to do during my period of dis-employment. I enjoy writing and I draw upon my experiences for inspiring new articles. Most of my posts are free-thinking with no real attempts to be authoritative and personally I lack an authoritative standing to be a source of reference for someone doing serious research.
I mentioned in my about page that I am doing this primarily to recreate the joy of writing that I enjoyed when I was younger when I wrote a lot on private papers that I subsequently discarded. This is the same thing only I’m leaving it where it is possible for others to find. One reason for doing this is to motivate me to put more effort in composing a complete thought and do some editing. It is safe to say that these blog posts have a better quality of writing than my private-paper writings, but I’m not seeking an audience. I’m just permitting an audience. I’m conscientious enough to worry about providing a somewhat worthwhile experience to who ever stumbles upon my writing.
Recently, I noticed that some people have been finding old blog posts directly from the Internet, apparently in doing their own research. I then put myself in their place. I wonder how I would feel if I found what they found after following what appeared to be a promising lead.
I am a big fan of searching for ideas on the Internet especially while working. Usually I am seeking detailed information from an authoritative source. In such a search, one can not not be blamed for becoming disappointed in finding what I write here. However, personally, I would appreciate this kind of personal generalized reflection even as I do my work. The abstract generalized discussion may do nothing to assist in my current challenges, but I would welcome finding another viewpoint of the issues.
One of the problems with data science is that it is generally a lonely field. A particular project has very specific and often very unique data issues that need to be worked and the detailed work is usually limited to a very small team. It is not practical or even permissible to discuss detailed problems outside of the team and it is hard to talk about it in general terms. Thus a lot of the hard work is done in isolation. It would have been nice find a kindred spirit to at least say you are not alone. Even while I was employed I would have welcomed some non-authoritative ramblings of someone reminiscing of past similar experience. Occasionally I did find such and appreciated the shared insight even if it was not helpful in the slightest. It was nice to know I wasn’t alone.
I am tying this back to my earlier post on the other dark data that I designated as unlit data. Unlit data is data that appears in data sets that has no useful purpose, it is not used operationally, and probably is not documented or validated. It just comes with the good data sometimes analogous to packing material, something meant to be discarded after extracting the good stuff. Unlit data may be useful just like sometimes packing material can be useful.
I recall an instance of receiving merchandise from a small mail-order outfit where they used a local newspaper wadded up to provide some cushioning for the product inside a shipping box. The newspaper was packing material but it held my attention for a while as I read it to compare the location and date of the paper with the location and date of the shipment, and glanced at the articles to get a sense of what kind of newspaper it was. This is like unlit data.
Another far more distant recollection occurred when I was growing up. We had moved into an old house and at one point decided to replace the linoleum flooring. Under the flooring they used old newspapers as a backing material. The flooring kept the papers fairly well preserved and they were older than I was at the time. It turns out that most if not all of the paper was of Sunday comic sections over multiple weeks. That discovery provided considerable entertainment time to one particular youngster.
I’m very humbled by other bloggers. Many bloggers are producing high quality material that are worthy of publication and archive. I don’t even pretend to compete with that kind of blogger. Although they publish in blogs, the material could easily be printed in a magazine if not an academic journal or even a reference book. After reading their biographies, I find that they are blogging this material while they are preparing even much more substantive work for formal publications, possibly on completely unrelated topics.
Wearing my information-seeking hate, those are the kinds of bloggers that are very helpful to find when they have something to say about my current challenge.
Blogging is a big space welcoming virtually any kind of written contribution from all perspectives. My writing just happens to appear in the same space as more authoritative blogs. We are somewhat equally likely to be found.
Or at least that is what I thought. I have been noticing a couple people finding my blog posts from Internet searches. I assumed I was showing up in some key word searches but the search results were not interesting enough to be clicked. I didn’t mind because it is not my intent to be found. But now that I’ve been blogging for several weeks with over 100 posts, I was curious to try to find some old posts by searching on some fragments from those posts.
First I tried just a few words and then found multiple pages of search results, but my blog post was not among the first several pages of those results. I then copied a longer passage and placed quotes around them and the result was the message that no document found matching this query.
I know some few people have been finding my blogs from outside of WordPress but how were they finding me?
I had been using Google. Over the years I had used other search engines but once I got used to Google, it became my primary choice for searching. I hadn’t really paid much attention but now that I think about it, it is now quite different then it was when I first encountered it. In technical terms, they have continued to improve their search engine in ways that improved their business. Obviously, they have been successful. But finally it occurred to me that what may have improved their commercial success may have degraded their value to the kind of stuff I like to find.
In particular, the top pages of Google searches are either always very recognizable site names or obvious commercial companies that either are popular or very clever in optimizing their search results. Increasingly, I find myself having to go several pages deep before I start to find the kind of material I once enjoyed finding.
I decided to use the same search terms in Bing. It is also biased toward putting the heavy-hitters on the front page, but at least they were finding my posts in the later pages. At least I knew that WordPress blogs can be found from searches. I rarely use Bing. I assumed they were more or less the same except one perhaps being faster or having more up-to-date information. I didn’t think they would actually come up with exclusively different results.
Then I tried Yahoo! I plugged in some words for a very recent post, and my blog post showed up on the first page. I did not expect that at all. I once was a big fan of Yahoo! but switched because the work environment praised the superiority of Google (this was many years ago). Perhaps this is a reminder of what I liked about it.
As I mentioned, when I am researching on a subject I am as interesting in finding non-authoritative personal anecdotes as I am in finding authoritative results. An example is searching for some technical information that will allow me to quickly apply that knowledge to my problem. My goal is to find find a very technically authoritative answer, ideally one that is an exact match to my problem but usually I’m satisfied if the information is correct and technically discussed sufficiently for me to understand the concept. In other words, I’m seeking professionally prepared material.
However, in that search, it is reassuring to find others who have encountered the same problem and either didn’t find an answer, or they discussed their experience in more general personal terms. Naturally, such postings are not useful to solve the problem but I am still glad to find them. As I mentioned, it can be a very lonely experience to resolve a problem. Even in dense cubicle environments, often the specific problem is one that I alone struggle to solve with no one nearby who can help. In such cases, it is a great relief to know of someone somewhere has encountered a similar experience. In the case of the unanswered postings to a question board, it is nice to know that what is being sought may not have a simple answer. In the case of the generalized thought-piece, it is reassuring to know that it solved even if he doesn’t tell me what his secret is.
Perhaps someone will find my postings helpful in a similar way. I’m sharing my possibly similar experiences even if I can’t offer a particular solution. But apparently, such a person would not find me using Google.
Maybe that is a blessing. Again, I am not seeking an audience and I don’t want to disappoint someone in finding that I offer no authority. Google is advanced enough to figure out I am not an authority. I am not wasting anyone’s time and that is great. I’m not doing this for page-views.
Hopefully the people use Yahoo! are those who are more adventuresome and want to find off-beat ramblings. If this is what they are looking for, then I welcome them and hope I don’t disappoint them.