Data Quality in SQL Server 2014 for dummies (Part 1)

 Jun 20, 2014

Data quality is a major concern for anyone building a data warehousing solution. As organisations consume more data from more data sources, the need for data quality management has become increasingly common. Data quality is especially important in a Business Intelligence solution, because the reports and analysis generated from data in the data warehouse can form the basis of important business decisions. Business users must be able to trust the data they use to make these decisions. In today’s post, I’ll highlight some common data quality issues, introduce Data Quality Services in SQL Server 2014 and finish with a crash course on three key elements that are key to data quality. Common Data Quality Issues Below are 3 data quality issues that frequently occur: 
  • Invalid data values – for example, an organisation might categorise its stores as “wholesale” or “retail.” However, a user might use an application that allows free-form data entry to create a store with a category of “reseller” instead of “retail,” or they might accidentally type “whalesale” instead of “wholesale.” Any analysis or reporting that aggregates data by store type will then produce inaccurate results because of the additional, invalid categories.
  • Inconsistencies – for example, an organisation might have an application for managing customer accounts in which US states are stored using two-letter state codes (such as“WA” for Washington), and a second application that stores supplier addresses with a full state name (such as “California”). When data from both of these systems is loaded into the data warehouse, your data warehouse will contain inconsistent values for states.
  • Duplicate business entities – for example, a customer relationship management system might contain customer records for Jim Corbin, Jimmy Corbin, James Corbin, and J Corbin. If the address and telephone number for these customers are all the same, then it might be reasonable to assume that all of these records relate to the same customer. Of course, it’s also possible that Jim Corbin has a wife named Jennifer, and a son named James, so you must be confident that you have matched the records appropriately before deduplicating the data.
Data Quality Services Overview Data Quality Services is a knowledge-based solution for managing data quality. With Data Quality Services, you can perform the following kinds of data quality management:
  • Data Cleansing – identifying invalid or inconsistent data values and correcting them.
  • Data Matching – finding duplicate data entities. Data Quality Services is installed from the SQL Server 2014 installation media, and consist of the following components:
    • Data Quality Services Server– a service that uses a knowledge base to apply data quality rules to data. The server must be installed on the same instance as the data that you wish to analyse. Two SQL Server catalogues are installed, and you can monitor, maintain, backup, and perform other administrative tasks on these from within SQL Server Management Studio. DQS_MAIN includes stored procedures, the DQS engine, and published knowledge bases. DQS_ PROJECT includes data that is required for knowledge base management and data quality project activities.
    • Data Quality Client– a wizard-based application that data stewards (typically business users) can use to create and manage data quality services knowledge bases and perform data quality services tasks. The client can either be installed on the same computer as the DQS server or used remotely.
    • Data Cleansing SSIS Transformation– a data flow transformation for SQL Server Integration Services that you can use to cleanse data as it flows through a data flow pipeline.
What is a Knowledge Base? Data Quality Services enables you to improve data quality by creating a knowledge base about the data, and then applying the rules in the knowledge base to perform data cleansing and matching. A knowledge base stores all the knowledge related to a specific aspect of the business. For example, you could maintain one knowledge base for a customer database and another knowledge base for a product database. Each knowledge base contains:
  • Domains that define valid values and correction rules for data fields, and
  • Matching policies that define rules for identifying duplicate data entities.
Knowledge bases are usually created and maintained by data stewards, who are often business users with particular expertise in a specific area of the business. Data Quality Services provides a basic knowledge base that includes domains for US address data (such as states and cities), which you can use to learn about data quality services and as a starting point for your own knowledge bases. What is a Domain? Domains are central to a Data Quality Services knowledge base. Each domain identifies the possible values and rules for a data field (that is, a column in a dataset). The values for each domain are categorised as:
  • Valid– for example, valid values for a US State domain might include “California” or “CA”
  • Invalid– for example, invalid values for a US State domain might include “8”
  • Error– for example, a common error for a US State domain might be “Calfornia” (with a missing “i”)
Values can be grouped as synonyms. For example, you might group “California,” “CA,” and “Calfornia” as synonyms for California, and you can specify a leading value to which all synonyms should be corrected. So, you could configure the domain to specify that instances of “CA” and “Calfornia” are automatically corrected to “California.” In addition to defining the values for a domain, you can create domain rules that validate new data values. For example, you could create a rule to ensure that all values in an Age domain are numbers or that all values in an Email Address domain include an “@” character. You can also specify standardisation settings for a text-based domain to enforce correct capitalisation. This enables you to ensure that cleansed text values have consistent formats. Often, you can create domains to represent the most granular level of your data, for example, First Name, but the actual unit of storage comprises multiple domains, for instance, Full Name. In this example, you can combine the First Name and Last Name domains to form a Full Name composite domain. Composite domains are also used for address fields which comprise of a combination of address, city, state, postal code, and country data. Another use of composite domains is a rule that combines data from multiple domains. For example, you can verify that the string “98007” in a Postal Code domain corresponds to the string “Bellevue” in a city domain. Matching can be performed on the individual domains that comprise the composite domain, but not on the composite domain itself. What is a Reference Data Service? Many data quality problems are outside the core specialisation of the organisation in which they are being used. For example, your organisation might be an Internet retailer and one of your core data problems, and highest unnecessary costs, is shipping goods based on incorrect address data. You may have made your website as user-friendly as possible, but there might still be an unacceptably high number of incorrectly addressed orders. To cleanse data that is outside the knowledge of your organisation, you can subscribe to third party Reference Data Service (RDS) providers. Using the Windows Azure Data Market, it is straightforward to subscribe to an RDS service and then use this service to validate and cleanse your data. To continue with the example, using the Windows Azure Data Market, you could purchase a subscription to an address verification service. You can then send data to the address verification service so that it verifies and cleanses the data, reducing incorrect address information and, therefore, reducing your postage costs. To use RDS to cleanse your data, you must follow these steps:
  1. Create a free DataMarket account key at the Windows Azure Marketplace
  2. Subscribe to a free or paid for RDS provider’s service at the Marketplace
  3. Configure the reference data service details in DQS
  4. Map your domain to the RDS service
  5. Use the knowledge base that contains the domain that maps to the RDS service to cleanse the data
One of the key advantages of using the Azure Data Market to provide DQS services is that the cost of the data service is typically based on the number of times you use the service per month. This allows you to scale up at busy times and reduce costs at quiet times. That’s a lot to take in, isn’t it? I’ll give you some time to digest all of the above, and in my next post, I’ll elaborate on how to create a knowledge base and how to clean data using DQS. Stay tuned!

How do your Excel skills stack up?   

Test Now  

About the Author:

Adam Keats  

With over 25 years of real-world IT experience Adam is one of New Horizons’ most senior Database and Software Development trainers. Starting out as a physics teacher Adam displayed exceptional mathematical and technical capabilities early on in his career. He went on to work in a variety of programming and technical management roles within several government agencies including the Department of Defence and Maritime Patrol. In 1998 Adam found his true calling, gaining MCT status and thus beginning his technical training career specialising in SQL Server administration, development, BI, and .NET development. Since then he has worked for several training organisations and found a home at New Horizons where he is now our resident Database and Development specialist. Throughout his tenure at New Horizons, Adam has assisted over 500 students in their endeavours to improve their skills, knowledge, and to achieve industry certifications.