Understanding Data Quality in Practice
Data quality refers to the extent to which data supports its intended use. Research consistently shows that data quality is multi-dimensional, encompassing factors such as completeness, consistency, relevance, and timeliness, rather than accuracy alone (Wang and Strong, 1996; Pipino, Lee and Wang, 2002). Data may appear correct at a record level but still be unsuitable for analysis if key values are missing, categories are unclear, or records are duplicated. These dimensions provide a practical framework for recognising when data may limit confidence in analytical outputs.
Common Data Quality Issues
Several data quality issues recur across organisations and sectors. Missing or incomplete data is one of the most frequently observed problems, often caused by optional data entry, system changes, or inconsistent collection practices (Batini and Scannapieco, 2016). Missing values can reduce the reliability of summaries and limit the conclusions that can be drawn from analysis.
Duplicate records commonly occur when data is captured from multiple systems or entered manually. Duplicate data can inflate counts, distort trends, and lead to incorrect interpretations if not identified early (Rahm and Do, 2000).
Inconsistent formats and classifications also undermine data quality. Variations in date formats, category labels, or coding standards make comparison and aggregation difficult and increase the risk of analytical error (Kim et al., 2003).
Another frequent issue is the presence of outdated or obsolete data. Data that was accurate at the point of collection may no longer reflect current conditions, particularly where records are not regularly reviewed or maintained (Batini and Scannapieco, 2016).
Finally, incorrect or misclassified values can arise through human error, system defaults, or unclear definitions. These issues are often subtle and may only become apparent once data is explored or combined with other datasets (Rahm and Do, 2000).
Identifying Data Quality Issues
Identifying data quality issues requires deliberate inspection before analysis begins. Techniques such as sorting, filtering, and reviewing frequency counts can reveal missing values, duplicates, and inconsistent categories. Checking value ranges and distributions can help highlight errors that fall outside expected limits (Kim et al., 2003).
This activity focuses on recognising faults rather than conducting formal audits. The aim is to understand where data may be unreliable and why, enabling informed decisions about whether and how the data should be improved prior to analysis.
Improving Data Quality Through Early Corrective Actions
Once issues have been identified, early corrective actions can improve data quality and usability. Common actions include removing duplicate records, correcting obvious errors, standardising formats, and documenting known limitations where issues cannot be fully resolved (Rahm and Do, 2000).
Research shows that addressing data quality issues early reduces the risk of compounding errors and supports more reliable analytical outcomes, particularly in environments where data is reused for multiple purposes (Abedjan et al., 2016). These early improvements prepare datasets for subsequent analysis and any later validation or auditing activities.
Action Point
Select a dataset used within your area of responsibility, such as performance, operational, or customer data. Review the data to identify at least two data quality issues, for example missing values, duplicates, or inconsistent categories. Explain how these issues could affect decisions or reporting, and describe the steps you would take to improve the data before it is used for insight