Dark Data is on the Rise

May 11, 2023

5 mins read

Photo by Andreas Niendorf on unsplash.com

‘Dark data’ isn't a term that many organisations are familiar with. This is despite the fact that most will have their fair share of dark data lurking in the recesses of their applications and document storage, and the amount of such data is, for most organisations, increasing every year.

So what is it? Dark data is the term given to data and documents that are relatively undiscoverable, in that you will likely be able to find it if you know where to look, but it will not be found easily.

Dark data is the term given to data and documents that are relatively undiscoverable

Dark data can be created in any number of ways, but here are some of the most common:

Siloed systems

Siloed systems can lead to the data equivalent of a desert island. Data is placed in a system, which over time falls out of regular use, or the system's ability to be searched makes it too difficult to find data. Either of these situations can and does lead to instances of dark data.

Complex search syntax

It’s hard to believe but there are still systems that require the user to learn complex search syntax just to be able to perform a ‘simple’ search. This creates barriers that make the system not just non user-friendly, but in extreme cases user-hostile. Users stop using them, and quickly the data stored within them becomes dark.

Misspellings

Systems that do not support semantic search rely on ‘classic search’ or keyword-based search, meaning that the exact same wording must be used as appears in the document (i.e. doesn't recognise synonyms) and the spelling of what is being sought can be of vital importance. Documents that contain misspellings will likely not be found by the keyword search and then become dark data.

Misfiling and mis-tagging

For those of you who didn't grow up in a world with physical filing cabinets full of paper-based documents, some of the concepts of computer-based document organisation appear archaic, and they have been proven to not scale well to meet the needs of modern businesses.

For us, taxonomies are included in this list; the organisation of items by a pre-determined paradigm. For example, a company might have a customer folder, beneath which there are countries and regions based on where the customer's main office is located. This is a simple taxonomy and one that exists in millions of businesses around the world.

Some of you will spot the flaws in such a taxonomy; for example, it has a fixed view of the world, one based on countries and regions. If some parts of the company need to look at the world differently, perhaps looking at customers by size rather than geography, then this organisation is redundant, leading different teams to set up their own ‘renegade’ taxonomy.

One of the biggest challenges with taxonomies is that their complexity grows over time and often subfolders are created, which can cause misfiling.

One of the biggest challenges with taxonomies is that their complexity grows over time

Thinking about our example again, an opportunity to misfile appears under the USA folder, where there is a city folder for New York City, but clients in this city could be filed in the NY state folder instead.

Some companies will employ ‘librarians’ whose primary role is to keep the taxonomy in good working order and to avoid misfiling.

Tagging is a similar approach where files or folders can have additional metadata added. This is a great approach for smaller companies and smaller amounts of data however, it can quickly fall into disrepair with larger teams. This can result in people assuming that they are getting the full picture when they're not.

In the early days of the Internet, before the dominance of Google, there were (and still are) many alternative search engines. Generally these required site owners to add their site and to do this they often had to identify where in the search engine’s taxonomy it should be found (often a maximum of three places).

This led to confusion, with sites being misclassified and the search engine results becoming inaccurate or manipulated by websites looking for more visitors.

It quickly became apparent to these search engines and their users that taxonomies were not the answer and thankfully, the founders of Google invented a more appropriate model.

Tagging and taxonomies can work very well in smaller, offline or highly curated environments, but they tend to scale poorly and result in a lot of data becoming undiscoverable .

Different words for the same things

This issue bears similarities to misfiling, but it occurs when different departments or people use different names for the same thing.

For example, we see different names for customers being used and slight variations. It is easy to see how a taxonomy might end up containing areas for Customers, Clients, Active Clients and Prospects. This is before we introduce the additional complexities of multiple languages within a taxonomy…

When you consider all of these scenarios, it’s easy to see how data becomes dark. But does it really matter? Yes. And in our next article we’ll give you plenty of reasons to sort out your search and banish dark data.