Big Data Cleanliness: What Are the Options?

Big data is term being thrown around in most businesses these days. With the mass amount of information available through digital records, data has become a gold mine for making business decisions.

And the trend does not look to go away as systems become more capable and complex. An entire field has emerged from IT focused entirely on data science and analysis.

Although the data is available and prevalent, getting the data in useable form can be costly and difficult. Plus, bad data has proven to point businesses in the wrong direction, making costly errors and choices. IBM estimates bad data is costing businesses upwards of $3.1 billion.

However, there may be development in tech that can make that difficulty a thing of the past.

What is big data?

Originally big data referred to data sets that were so complex traditional processing software was unable to handle the information. However, the definition has shifted as IoT devices have flooded the system with data.

Lately the term big data refers to the use of analytics extracted from the data and not the size of the data.

Analysis of data sets can help spot business trends, focus on shopper behavior, prevent diseases, and spot crime patterns. Which means almost every industry has a purpose for analyzing data. And there is plenty to analyze.

Data sets grow exponentially on a regular basis due to the amount of IoT devices feeding into the system.

Basic terminology

People use a lot of terms when discussing big data. While some of the terms seem interchangeable, they represent something different in the field.

Data Mining: the process by which a computer discovers patterns in large data sets. The purpose is to extract information and put it into a structure that is understandable in future analysis. This is a portion of knowledge discovery.

Data Analytics: automating insights gained from data mining and assumes the use of queries and aggregation procedures.

Data Analysis: the human portion of gaining insight into a data set. A person is responsible for analysis, rather than an automated system.

Data Warehouse: a method for storing and reporting generalized data, in which data must be in a specific form and structured.

Data Lake: a method for storing in which data is input in a natural state and can support all data sources.

Struggles with big data

As mentioned above, data collection on an hourly and minute basis is astounding. Enterprise choses for storage decides how much work needs to be put into the data and whether data can be cleaned up before storage or during mining.

However, in both options human error and bad data is a difficulty to overcome.

Data must be clean and in a single format before storage in a data warehouse. The system will reject or delete any data not in the specified format. Storing data in a warehouse can be time consuming and tedious.

Similarly, data lakes are not without their issues. Individuals pulling data and putting it into a format that is ready for analysis can be time consuming due to no formatting upfront. A data scientist can spend more time cleaning data than analyzing data form information.

Benefit of machine learning

Although data analytics is an automated system, algorithms will not work with unclean data. Sometimes weeks into processing the system will break and the results will be worthless.

Unfortunately, clean data is unlikely as years of data accumulation have occurred with little hygiene occurring. The advent of the data lake promised storage where hygiene wasn’t an issue. However, companies are quickly learning garbage in equals very little help in the way of analysis.

Acknowledgment of this weakness has caused companies such as IBM to focus on using AI. The new technology, termed a data abstraction layer, is designed to use machine learning to make data more useable. Although unable to completely automate cleaning up data, the promise is tools that can clean, append, and make connections where there were none previously.

A step further in machine learning is deep learning, which is based on data representations rather than specific algorithms. This form of AI machine learning takes a brain approach, focusing more on making connections.

However, there is no perfect answer. Most of these companies have partnerships for data warehouses or data lakes, or they are offering a cloud based service for storage. They are offering an easy way to clean up bad data, but with the extended expectation of keeping your data with them.

Big data is going to get larger as more information is fed into the system regularly. While businesses can glean amazing information regarding behaviors and trends, businesses will spend a lot of money on getting data in a useable form unless the data is clean.

Or in worst case scenarios, money will be wasted making bad decisions.


Say goodbye to downtime and hello to new opportunities.