What? Cosmic rays?

 Jun 16, 2015

Okay, so the title. What's that about? I mean, good gravy, how are cosmic rays in anyway relevant to a blog about SQL Server? Srsly?

Have you ever heard of DBCC? No? Okay, it's a Transact-SQL statement. It stands for Database Console Command. Hands up all those who thought it stood for “Database Consistency Checker”. Okay, just a couple of you, pretty good. So, DBCC statements can be grouped into four categories: Maintenance, Miscellaneous, Informational and Validation.

Today, I'll be looking at a maintenance category statement, DBCC CHECKDB. Now, DBCC CHECKDB is a statement that will run a check on the logical and physical integrity of all the objects in the specified database by performing the following operations:

  • DBCC CHECKALLOC on the database – checks the consistency of disk space allocation structures for a specified database.
  • DBCC CHECKTABLE on every table and view in the database – checks the integrity of all the pages and structures that make up the table or indexed view.
  • DBCC CHECKCATALOG on the database – checks for catalog consistency within the specified database.
  • Validate the contents of every indexed view in the database.
  • Validate link-level consistency between table metadata and file system directories and files when storing varbinary(max) data the file system using FILESTREAM.
  • Validate the Service Broker data in the database.

Phew! That's quite a bit for only 12 keystrokes. As you can probably imagine, that's a particulary resource intensive process. Needless to say it can cause performance problems when it's run concurrently with a workload. Some people say, and I know I'm using a “Strawman argument,” that you should rely on the WITH CHECKSUM option when performing a full database backup instead. I beg to differ.

First, some background. SQL Server has some page protection options. I'd prefer to call them page corruption detection options. Mainly because they don't actually protect anything from being corrupted, they'll just let you know when they have been. A subtle difference, but significant none the less.

The default page protection option is the CHECKSUM, which enables page checksums, and is enabled on all new databases since SQL Server 2005 (unless someone turned it off in the model database, that is). This means that when a data file page is written to disk, usually by a CHECKPOINT operation, a simple checksum is calculated for the 8KB page contents and written into the page's header.

Alright, so when a data file page is read from disk into memory, the checksum for the page is recalculated and compared to the value in the page's header. If they don't match, SQL Server knows that something has gone wrong with the I/O subsystem and raises an 824 error.

Additionally, SQL Server will check the page's checksums during a backup if you've specified the WITH CHECKSUM option. If there's a problem here, the backup will fail and the corrupted page will be reported. So far, so good. But wait, there's more.

Hopefully, you can see the problem here. Page checksums detect page corruption caused by the I/O system. But what if the corruption hasn't been caused by the I/O subsystem?

Consider this situation. SQL Server reads a data file into memory from disk. There is nothing wrong with the page at this stage and the checksums verify that. The page is modified via an update statement. Now, before the next checkpoint operation, a cosmic ray strikes a section of a RAM chip. It corrupts the 8KB block holding the modification. The checkpoint occurs, the checksum over the page is calculated and written to the header of the page when it is persisted to the disk.

The page is later read from disk to memory and the checksum is validated, no problems are detected. Only we know there is a problem. The page checksum algorithm is incapable of detecting this kind of error since it didn't occur as a result of a faulty write in the I/O subsystem. Similarly, a backup operation won't detect any problems either for precisely the reasons already stated. By design, page checksums can only detect corruption that occured in the I/O subsystem.

This is where our good friend DBCC CHECKDB comes in. It interprets the page contents by validating row structures, validating column values and checking linkages in indexes. It should always be able to detect a corrupt page, even if the corruption occured in memory.

My advice is this: Use the WITH CHECKSUM option when you're doing your full database backups by all means, but please include some provision for running a regular DBCC CHECKDB. That check doesn't need to be run on a production system either, you could always restore the backup on another server and run consistency checks on that server. Remember, page checksums are fallible.

See, cosmic rays are something we should all be concerned about.
“See” you next time. Cheers.

How do your Excel skills stack up?   

Test Now  

About the Author:

Adam Keats  

With over 25 years of real-world IT experience Adam is one of New Horizons’ most senior Database and Software Development trainers. Starting out as a physics teacher Adam displayed exceptional mathematical and technical capabilities early on in his career. He went on to work in a variety of programming and technical management roles within several government agencies including the Department of Defence and Maritime Patrol. In 1998 Adam found his true calling, gaining MCT status and thus beginning his technical training career specialising in SQL Server administration, development, BI, and .NET development. Since then he has worked for several training organisations and found a home at New Horizons where he is now our resident Database and Development specialist. Throughout his tenure at New Horizons, Adam has assisted over 500 students in their endeavours to improve their skills, knowledge, and to achieve industry certifications.