A retrospective look at the wave of consolidations among vendors in the data quality and data management industry shows one specific similarity across the board – the transition to building an end-to-end data management suite is incomplete without the incorporation of a data profiling product. The reason for this is that many good data management practices are based on a clear understanding of “content,” ranging from specific data values, the characteristics of the data elements holding those values, relationships between data elements across records in one table, or associations across multiple tables. It is worth reviewing some basic application contexts in which profiling plays a part, and that will ultimately help to demonstrate how a collection of relatively straightforward analytic techniques can be combined to shed light on the fundamental perspective of information utility for multiple purposes.
One might presume that when operating in a well-controlled data management framework, data analysts will have some understanding of what types of issues and errors exist within various data sets. But even in these types of environments there is often little visibility into data peculiarities in relation to existing data dependencies, let alone the situation in which data sets are reused for alternate and new purposes.
So in order to get a handle on data set usability, there must be a process to establish a baseline measure of the quality of the data set, even distinct from specific downstream application uses. Anomaly analysis is a process for empirically analyzing the values in a data set to look for unexpected behavior to provide that initial baseline review. Essentially, anomaly analysis:
- Executes a statistical review of the data values stored in all the data elements in the data set,
- Examines value frequency distributions,
- Examines the variance of values,
- Logs the percentage of data attributes populated,
- Explores relationships between columns, and
- Explores relationships across data sets
to reveal potentially flawed data values, data elements, or records. Discovered flaws are typically documented and can be brought to the attention of the business clients to determine whether each flaw has any critical business impact.
Data Reverse Engineering
The absence of documented knowledge about a data set (which drives the need for anomaly analysis) accounts for the need for a higher level understanding of the definitions, reference data, and structure of the data set – its metadata. Data reverse engineering is used to review the structure of a data set for which there is little or no existing metadata or for which the existing metadata is suspect for the purpose of discovering and documenting the actual current state of its metadata.
In this situation, data profiling is employed to incrementally build up a knowledge base associated with data element structure and use. Column values are analyzed to determine if there are commonly used value domains, if those domains map to known conceptual value domains, to review the size and types of each data element, to identify any embedded pattern structures associated with any data element, as well as identify keys and how those keys are used to refer to other data entities.
The results of this reverse engineering process can be used to populate a metadata repository. The discovered metadata can be used to facilitate dependent development activities such as business process renovation, enterprise data architecture, or data migrations.
Data Quality Rule Discovery
The need to observe dependencies within a data set manifests itself through the emergence (either by design or organically through use) of data quality rules. In many situations, though, there is no documentation of the rules for a number of reasons.
As one example, the rules are deeply embedded in application code and have never been explicitly associated with the data. As another example, the system may inadvertently have constrained the user from being able to complete a task, and user behavior has evolved to observe unwritten rules that enable the task to be performed.
Data profiling can be used to examine a data set to identify and extract embedded business rules, whether they are intentional but undocumented, or purely unintentional. These rules can be combined with predefined data quality expectations and used as the targets for data quality auditing and monitoring.
Auditing and Monitoring
Data profiling provides the ability to measure compliance with defined data rules. The data quality service level agreements and data quality scorecard can be monitored by using a data profiling tool to periodically review data sets using defined rules to provide metrics demonstrating that defined expectations are being met. Alternatively, non-observance of defined rules can point out potential process failures and help in root cause analysis.
Metadata Compliance and Data Model Integrity
The results of profiling can also be used as a way of determining the degree to which the data actually observes any already existent metadata, ranging from data element specifications, validity rules associated with table consistency (such as uniqueness of a primary key), as well as demonstrating that referential integrity constraints are enforced properly in the data set.
Data quality application developers can build upon these analysis paradigms to implement a number of different data quality management services. These types of services, all built on top of a data profiling platform, address multiple stages of the data quality “virtuous cycle,” enabling assessment, definition of target objectives and threshold scores for acceptability, issue identification and logging, and root cause analysis.
SOURCE: Application Contexts for Data Profiling
Recent articles by David Loshin