Home / Blog / Web Data / Bad Data: Definition and Examples
What is bad data? Find out in this article
Bad data is a critical yet recurrent issue that affects the efficiency, decision-making, and productivity of any business.
Simply put, bad data refers to incomplete, inaccurate, inconsistent, irrelevant, or duplicate data that sneaks into your data infrastructure due to a variety of reasons. By the end of this article, you will understand: What is bad data Various types of bad data What causes bad data Its consequences and preventive measures
Simply put, bad data refers to incomplete, inaccurate, inconsistent, irrelevant, or duplicate data that sneaks into your data infrastructure due to a variety of reasons.
By the end of this article, you will understand:
So, let’s take a look at this further:
Data quality and reliability are essential in almost every domain, from business analysis to AI model training. Poor quality data manifests in several different forms, each posing unique challenges to data usability and integrity.
Incomplete data refers to when a dataset lacks one or more of the attributes, fields, or entries necessary for accurate analysis. This missing information renders the entire dataset unreliable and sometimes even unusable.
Common causes for incomplete data include intentional omission of specific data, unrecorded transactions, partial data collection, mistakes during data entry, unseen technical issues during data transfer, etc. For example, consider a situation where a customer survey is missing records of the contact details. That makes it impossible to follow up with the respondents later on, as shown below.
Common causes for incomplete data include intentional omission of specific data, unrecorded transactions, partial data collection, mistakes during data entry, unseen technical issues during data transfer, etc.
For example, consider a situation where a customer survey is missing records of the contact details. That makes it impossible to follow up with the respondents later on, as shown below.
Another example can be a hospital database with medical records of patients lacking crucial information such as allergies and previous medical history can even lead to life-threatening situations.
Duplicate data occurs when the same data entry or nearly identical data entries are recorded multiple times within the database. This redundancy leads to misleading analytics and incorrect conclusions and sometimes complicates merge operations and system glitches. The statistics derived from a dataset with duplicate data become unreliable and inefficient for decision-making.
Examples:
Having incorrect, erroneous information within one or more dataset entries is identified as having inaccurate data.
A simple mistake in a code or a number due to a typographical error or an unintentional oversight can be serious enough to cause severe complications and losses, primarily when the data is used for decision-making in a high-stakes domain. The existence of inaccurate data itself diminishes the trustworthiness and the reliability of the whole dataset.
Inconsistent data, which occurs when different people or teams use varying units or formats for the same type of data within an organization, is a common cause of confusion and inefficiency you might come across when working with data. It interrupts the uniformity and continuous flow between data, resulting in faulty data processing.
Simply put, outdated data are records that are no longer current, relevant, and applicable. Especially in fast-moving domains, outdated data is quite common, with rapid changes occurring continuously. Data from a decade, a year, or even a month ago can be no longer useful, even misleading, based on the context.
Furthermore, non-compliant, irrelevant, unstructured, and biased data are also types of bad data that can compromise the data quality in your data ecosystem. Understanding each of these various bad data types is essential for realizing their root causes and the threats they pose to your business and for devising strategies to mitigate the impact.
Now that you have a clear understanding of the types of bad data. It’s important to understand what causes it, so that you can take proactive measures to prevent such occurrences from happening in your datasets.
Some ways that can cause bad data includes:
If you process datasets that have bad data, you put your end analysis at risk. In fact, bad data can have long-lasting and devastating impacts, especially on data-driven businesses and domains; such as:
In addition, bad data can lead to critical errors that accelerate into legal or life-threatening complications, especially in the financial and healthcare domains.
For instance, in 2020, during the COVID-19 pandemic, Public Health England (PHE) experienced a significant data management error that resulted in 15,841 COVID-19 cases going unreported due to bad data. The issue was traced back to the outdated version of Excel spreadsheets PHE was using, which could only hold up to 65,000 rows, rather than the million plus rows it could actually hold. Some of the records provided by the third-party firms analyzing swab tests were lost, causing incomplete data. The number of missed close contacts with infection risk due to this technical error was about 50,000.
Additionally, Samsung’s fat-finger error that occurred in 2018 ended up dropping the stock prices by around 11% within a single day, extinguishing nearly $300 million of market value. It was caused by a Samsung Securities employee because of a data entry mistake when he entered 2.8 billion “shares” (worth $105 billion) instead of 2.8 billion “South Korean Won” to be distributed among employees who took part in the company stock ownership plan. Therefore, the consequences of bad data should not be taken lightly, and proper preventive measures must be taken to eliminate the risk.
Additionally, Samsung’s fat-finger error that occurred in 2018 ended up dropping the stock prices by around 11% within a single day, extinguishing nearly $300 million of market value. It was caused by a Samsung Securities employee because of a data entry mistake when he entered 2.8 billion “shares” (worth $105 billion) instead of 2.8 billion “South Korean Won” to be distributed among employees who took part in the company stock ownership plan.
Therefore, the consequences of bad data should not be taken lightly, and proper preventive measures must be taken to eliminate the risk.
No dataset is perfect. Your data is bound to have errors. The first step to preventing bad data is acknowledging this reality so that you can implement necessary preventive strategies to ensure data quality.
Some steps to prevent bad data includes:
This article explored what bad data is, the different types of bad data you may encounter, and their causes. In addition, it highlighted the significant negative impact of bad data on a data-driven organization, from financial losses to business failures. Understanding these factors is the first step in preventing bad data.
Even though there are multiple preventive strategies to ensure data quality, employing a reliable tool specifically designed for the cause is bound to take the load off your shoulders.
Consider using data scraping tools that let you automatically build reliable and clean datasets. This takes the effort out from your end, and leaves you with clean, and directly usable data. So, consider leveraging such tools to reduce the effort needed from your end, while increasing the productivty.
Thank you for reading.
9 min read
Jonathan Schmidt
7 min read
Wyatt Mercer
11 min read