By: Alex Berson and Larry Dubov
Service provider takeaway:As any data management consultant knows, one of the key components of any successful data strategy is data quality. This section of the chapter excerpt from the book Mastering Data Management and Customer Integration for a Global Enterprise explores the concepts behind data quality.
Download the .pdf of the chapter here.
Data quality is one of the key components of any successful data strategy and data governance initiative, and is one of the core enabling requirements for Master Data Management and Customer Data Integration.
Indeed, creating a new system of record from information of low quality is almost an impossible task. Similarly, when data quality is poor, matching and linking records for potential aggregation will most likely result in low match accuracy and produce an unacceptable number of false negative and false positive outcomes.
Valuable lessons about the importance of data quality are abundant, and data quality concerns confronted data architects, application designers, and business users even before the problem started to manifest itself in the early data integration programs such as Customer Information Files (CIF), early implementations of data warehouses (DW), Customer Relationship Management (CRM), and Business Intelligence (BI) solutions. Indeed, if you look at a data integration solution such as a data warehouse, published statistics show that as high as 75 percent of the data warehouse development effort is allocated to data preparation, validation, and extraction, transformation, and loading (ETL). Over 50 percent of these activities are spent on cleansing and standardizing the data.
Although there is a wide variety of ETL and data cleansing tools that address some of the data quality problem, data quality continues to be a complex, enterprise-class challenge. Part of the complexity that needs to be addressed is driven by the ever-increasing performance requirements. A data cleansing tool that would take more than 24 hours to cleanse a customer file is a poor choice for a real-time or a web-based customer service application. As the performance and throughput requirements continue to increase, the functional and technical capabilities of the data quality tools are sometimes struggling to keep up with the demand.
But performance is not the primary issue. A key challenge of data quality is an incomplete or unclear set of semantic definitions of what the data is supposed to represent, in what form, with what kind of timeliness requirements, etc. These definitions are ideally stored in a metadata repository. Our experience shows that even when an enterprise adapts a metadata strategy and implements a metadata repository, its content often contains incomplete or erroneous (poor quality) definitions. We'll discuss metadata issues in more details later in this chapter.
The quality of metadata may be low not because organizations or data stewards do not work hard on defining it, but primarily because there are many data quality dimensions and contexts, each of which may require a different approach to the measurement and improvement of the data quality. For example, if we want to measure and improve address information about the customers, there are numerous techniques and reference data sources that can provide an accurate view of a potentially misspelled or incomplete address. Similarly, if we need to validate a social security number or a driver license number, we can use a variety of authoritative sources of this information to validate and correct the data. The problem becomes much harder when you deal with names or similar attributes for which there is no predefined domain or a business rule. For example, "Alec" may be a valid name or a misspelled "Alex." If evaluated independently, and not in the context of, say, postal information about a name and the address, this problem often requires human intervention to resolve the uncertainty.
Finally, as the sophistication of the data quality improvement process grows, so do its cost and processing requirements. It is not unusual to hear that an organization would be reluctant to implement an expensive data quality improvement system because, according to them, "…so far the business and our customers do not complain, thus the data quality issue must not be as bad as you describe." This is not an invalid argument, although it may be somewhat shortsighted from the strategic point of view, especially since many aspects of data quality fall under government- and industry-regulated requirements.
Data Quality Tools and Techniques
There are many tools that automate portions of the tasks associated with cleansing, extracting, loading, and auditing data from existing data stores into a new target environment, be it a data warehouse or a CDI Data Hub. Most of these tools fall into several major categories:
- Auditing tools These tools enhance the accuracy and correctness of the data at the source. These tools generally compare the data in the source database to a set of business rules that are either explicitly defined or automatically inferred from a scan operation of the data file or a database catalog. Auditing tools can determine the cardinality of certain data attributes, value ranges of the attributes in the data set, and the missing and incomplete data values, among other things. These tools would produce various data quality reports and can use their output to automate certain data cleansing and data correction operations.
- Data cleansing tools These tools would employ various deterministic, probabilistic or machine learning techniques to correct the data problems discovered by the auditing tools. These tools generally compare the data in the data source to a set of business rules and domain constraints stored in the metadata repository or in an external rules repository. Traditionally, these tools were designed to access external, reference data such as a valid name and address file from an external "trusted" data provider (e.g., Acxiom or Dun & Bradstreet), or an authoritative postal information file (e.g., National Change of Address [NCOA] file), or to use a service that validates social security numbers. The data cleansing process improves the quality of the data and potentially adds new, accurate content. Therefore, this process is sometimes referred to as data enrichment.
- Data parsing and standardization tools The parsers would break a record into atomic units that can be used in subsequent steps. For example, such a tool would parse one contiguous address record into separate street, city, state, and zip code fields. Data standardization tools convert the data attributes to what is often called a canonical format or canonical data model -- a standard format used by all components of the data acquisition process and the target Data Hub.
- Data extraction, transformation, and loading (ETL) tools are not data quality tools in the pure sense of the term. ETL tools are primarily designed to extract data from known structures of the source systems based on prepared and validated source data mapping, transforming input formats of the extracted fi les into a predefined target data store format (e.g., a Data Hub), and loading the transformed data into a target data environment, e.g., the Data Hub. Since ETL tools are aware of the target schema, they can prepare and load the data to preserve various integrity constraints including referential integrity and the domain integrity constraints. They can filter out records that fail a data validity check, and usually produce exception reports used by data stewards to address data quality issues discovered at the load stage. This functionality helps ensure data quality and integrity of the target data store, which is the reason we mentioned ETL tools in this section.
- Hybrid packages These packages may contain a complete set of ETL components enriched by a data parser and a standardization engine, the data audit components, and the data cleansing components. These extract, parse, standardize, cleans, transform, and load processes are executed by a hybrid package software in sequence and load consistently formatted and cleansed data into the Data Hub.
About the book
Master Data Management and Customer Data Integration for a Global Enterprise explains how to grow revenue, reduce administrative costs, and improve client retention by adopting a customer-focused business framework. Learn to build and use customer hubs and associated technologies, secure and protect confidential corporate and customer information, provide personalized services, and set up an effective data governance team. Purchase the book from McGraw-Hill Osborne Media.
Reprinted with permission from McGraw-Hill from Master Data Management and Customer Data Integration for the Global Enterprise by Alex Berson and Larry Dubov (McGraw-Hill, 2007)