Data Quality in SQL Server 2014 for dummies (Part 2)

 Jun 26, 2014

In my last blog post, we had a mini crash course on data quality and the three elements essential to data quality – knowledge base, domain, and Reference Data Service (RFS). If you didn’t get a chance to read it, you can access it here. It’ll definitely help you better understand today’s blog post, where I’ll go through another crash course focusing on creating knowledge bases and data cleansing. Creating a knowledge base Building a Data Quality Services (DQS) knowledge base is an iterative process that involves the following:
  1. Knowledge discovery – using existing data to identify domain values, and
  2. Domain management – categorising discovered values as valid, invalid, or errors; specifying synonyms and leading values; defining correction rules, and other domain configuration tasks.
The data steward can create the initial knowledge base from scratch, base it on an existing knowledge base, or import a knowledge base from a data file. Then the knowledge discovery process is used to identify data fields that need to be managed, map these fields to domains in the knowledge base (which can be created during knowledge discovery if required), and identify values for these fields. After the knowledge discovery process has populated the knowledge base, the data steward manages the domains to control how DQS validates and corrects data values. Additionally, domain management may include configuring reference data services, or setting up term based or cross-field relationships. Creating a knowledge base is not a one-time activity. A data steward will continually use the knowledge discovery and domain management processes to enhance the knowledge base and manage the quality of new data values and domains. Using DQS to cleanse data One of the major tasks that a data quality management solution must do is to cleanse data by validating and correcting domain values. In this section, we’ll take a look at creating a data cleansing project, viewing cleansed data and using the data cleansing flow transformation. Creating a data cleansing project Data stewards can use the Data Quality Client application to create a data cleansing project that applies the knowledge in a knowledge base to data in a SQL Server database or an Excel workbook. When creating a data cleansing project, the data steward must:
  1. Select the knowledge base to use for the project and specify that the action to be performed is cleansing.
  2. Select the data source containing the data to be cleansed and map the columns in it to the domains in the knowledge base.
  3. Run the data cleansing process and then review the suggestions and corrections generated by DQS. The data steward can then approve or reject the suggestions and corrections.
  4. Export the cleansed data to a database table, comma-delimited file, or Excel workbook.
Viewing cleansed data The output from a data cleansing project includes the cleansed data as well as additional information about the corrections made by DQS. The columns in the output are named by combining the name of the domain and the type of data in the column. For example, the cleansed output for a domain named ‘State’ is stored in a column named ‘State_Output.’ Cleansed data output includes the following types of column:
  • Output – the values for all fields after data cleansing. All fields in the original data source generate output columns, even those not mapped to domains in the knowledge base (in which case they contain the original data values).
  • Source – the original value for fields that were mapped to domains and cleansed.
  • Reason – the reason the output value was selected by the cleansing operation. For example, a valid value might be corrected to a leading value defined for the domain, or DQS might have applied a cleansing algorithm and suggested a corrected value.
  • Confidence – an indication of the confidence DQS estimates for corrected values. For values corrected to leading values defined in the knowledge base, this is usually 1 (or 100%). When Data Quality Services uses a cleansing algorithm to suggest a correction, the confidence is a value between 0 and 1.
  • Status – the status of the output column. A value of correct indicates that the original value was already correct, and a value of corrected indicates that Data Quality Services changed the value.
Using the data cleansing data flow transformation In addition to creating data cleansing projects to cleanse data interactively, you can use data cleansing transformation to perform data cleansing in an SSIS data flow. Using data cleansing transformation enables you to automate data cleansing as a part of the extract, transform, and load (ETL) processes used to populate your data warehouse. To add the data cleansing transformation to a data flow in an SSIS package, you will need to perform the following steps:
  1. Add the data cleansing transformation to the data flow and drag a data flow connection from a source or transformation that contains the data you want to cleanse to the input of the data cleansing transformation.
  2. Edit the settings of the data cleansing transformation to connect to the DQS, specify the knowledge base you want to use, and map the input columns in the data flow to domains in the knowledge base.
  3. Drag the output from the data cleansing transformation to the next transformation or destination in the data flow, and map the output columns from the data cleansing transformation to the appropriate input columns in the transformation or destination. The output columns from the data cleansing transformation are the same as those generated by an interactive data cleansing project.

How do your Excel skills stack up?   

Test Now  

About the Author:

Adam Keats  

With over 25 years of real-world IT experience Adam is one of New Horizons’ most senior Database and Software Development trainers. Starting out as a physics teacher Adam displayed exceptional mathematical and technical capabilities early on in his career. He went on to work in a variety of programming and technical management roles within several government agencies including the Department of Defence and Maritime Patrol. In 1998 Adam found his true calling, gaining MCT status and thus beginning his technical training career specialising in SQL Server administration, development, BI, and .NET development. Since then he has worked for several training organisations and found a home at New Horizons where he is now our resident Database and Development specialist. Throughout his tenure at New Horizons, Adam has assisted over 500 students in their endeavours to improve their skills, knowledge, and to achieve industry certifications.