What is Dark Data?

How often have you spent way too much time examining through all the documents and files in your computer looking for the right one? In this digital world, how often do you actually delete a document from your hard drive? Most of us will just file it away “just in case” we need to access that document later on. Electronic files don’t really take up much space; and if ever space becomes an issue, we can always put expansion drives to boost disk space. The real problem arises when you need to access a file. Even if you use the computer’s search function, together with the advanced search options, you would still get a list of results that’s way too long to be helpful.

Dark data is the information organizations collect, process, and store during regular business activities, but generally fail to use for other purposes. Think about it — anyone has the ability to create and store information. Taking into consideration people in the digital era, anyone can create files and store their output in their respective computers.  Put it in a business organization, employees can produce information suited to the varying needs and requirements of their respective business units. Dark data is the clutter that accumulates for each user, business unit, and organization; this often comprises most organizations’ universe of information assets: archives, legacy file shares, backup tapes, and email stores that are predominantly unclassified and not visible or accessible. Data becomes dark because no one is paying attention.

In the field of research, dark data pertains to the primary output of the scientific enterprise that has never been published or otherwise made available to the rest of the scientific community. It refers to any data that is not easily found by potential users. For example, dark data is the type of data that exists only in the bottom left-hand desk drawer of scientists on some media that is quickly aging and soon will be unreadable by commonly available devices. The data remains in this dark desk drawer, inaccessible to the scientific community until the scientist retires, and eventually, this desk drawer will be emptied into a dumpster, unknown to the rest of the world.

Dark data can also pertain to data that is not made available to the public. Take into consideration our pre-internet era movies. We used to rent movies in VCD and DVD formats from physical store fronts and play these on our players. The physical inventory of VCD and DVD was very limited by the space available in the shelves of the store front. Hence, stores would stock only titles that would be rented frequently enough to justify the storage space and maximize rental revenue. Less frequently viewed movies were not easily available to people. Customers may not even know of the existence of films that they would like to see because they did not see them in the local store. These films were essentially dark data. However, the Internet changed these economics by separating inventory from the point of sale. To the surprise of many, it turned out that there was a great deal of value in the rarely viewed movies. While there may only be a few dozen or hundred people interested in seeing a boutique title in a particular year, there are many thousands of such rarely viewed titles. Search tools and the Internet allowed people to find and rent boutique films and bring them to light and in their television screens.

Unpublished data of “failed” experiments also fall under the dark data category. In this use of the term, “failed” refers not to bad science but to the fact that only positive results tend to be published. Experiments that accurately demonstrate no effect of the treatment condition are valid findings but are less likely to be published. The data therefore becomes “dark data” and later meta-analyses of the literature provide a skewed view of the actual scientific findings. While such unpublished data indeed are difficult to find (and therefore “dark”), there are many types of “positive” research findings and raw data that lie behind published works which are also difficult or impossible to access as time progresses.

Simply put, dark data is the data that is not classified, not analyzed, or data that is underutilized and not put into meaningful use, but has the potential to help us understand the world around us. Dark data exists throughout the fabric of our material world, especially in our use of technology to store data. There is a wealth of data that is almost impossible to see, but possible to find. Because it is difficult to find dark data, it is underutilized and routinely lost. With appropriate planning and technology this data can be brought to light and made more useful to the community. However, this move also comes at a price: storing and securing data typically incurs more expenses, together with the greater risks of exposing the data to the public, than its potential value. The challenge of dark data then becomes to balance its liability in storage and security with the potential profit gains from using the information more strategically.

While dark data can simply mean raw unprocessed data, its storage can be a potential problem. In the business sphere, the job would turn to employees and IT to manage the massive volumes of data. The compounding amounts of data can become simply too big to manage and finding information becomes an increasingly complex proposition. In fact, the cost of holding information goes well beyond how much it costs to store that information. Being unused data, dark data risks the problem of data security. How do you protect a growing volume of data when you don’t know what you’re protecting?

The challenge is there and left for us to be responsible producers and collectors of data. Let us consolidate information by bringing together data sources for easier management, searching, etc. It’s easier to make decisions when there is something specific and real driving the need. Bring in the right people to assist you and documenting your process. There are a lot of options on the market to help you with locating, understanding, and ultimately managing your dark data. Technology can be used to assist with data/archive migration. It can also help you with data disposition to help you minimize your data footprint, reduce risk, and simplify your discovery process and obligation. After all, this is not a one-time event, but an ongoing and evolving process.

Written by Infinit Datum

Infinit Datum’s content team consists of data and research industry professionals and experts who regularly contributes blog posts related to Big Data, market research, data and research outsourcing, and more.

Leave a Reply

Your email address will not be published. Required fields are marked *