Implementing Big Data Solutions in SQL Server 2014

 May 16, 2014

Hadoop. Hadoop. Hadoop. Big Data. Big Data. Big Data.  How many times have you heard the above terms in the last six months? How many times do you think one or the other, or even both, have been used by someone who has no clue as to what they mean? I reckon you’ll agree that those numbers coincide pretty well. Unfortunately, terms like these are used quite often without the person using them knowing what they’re about. “Cloud” was a term used in this manner only a short time ago. So we’ve got a new term, or terms, to bandy about. Great! Have you ever heard of “Buzzword Bingo?” I’ve played it from time to time, it’s fun. I’m sure you’ve experienced something similar.

Implementing Big Data Solutions in SQL Server 2014

In the meantime though, let’s see if we can unravel some of this stuff. You know that the reason I want to do that is because Microsoft has something for us…am I that transparent? Well, yes, yes I am. Microsoft has some pretty cool “Big Data” tools for us to use. Yes, I know this is scary, but stay with me. Hopefully, I can manage to explain at least some of this so it’s not so daunting, and believe me, some of this stuff is quite complex. Firstly, we will need to establish just what “Big Data” is. Companies, individuals, services and even devices are generating huge amounts of data every day, and that’s a trend that is increasing sharply. Social media, photography and video capture, profile data to personalise your online experience and many others, have led to a massive increase in the amount of data available for processing. Organisations are quickly realising the value of using this “Big Data” in much of the same way as they have traditionally used their own internal data for Business Intelligence. So generally, this “Big Data” is too large or too complex to manage and process in a traditional relational database or even a data warehouse. Systems like SQL Server 2014 are more than capable of handling many terabytes of data. Some organisations, however, are faced with dealing with multiple petabytes of data in multiple, non-uniform, non-relational formats. Let’s just consider that for a moment. A petabyte is 1024 terabytes. If you counted all the bits in one petabyte at one bit per second, it’d take you 285 million years. That’s a LOT OF DATA and it’s only going to get bigger! “Big Data” is typified by the so-called “Three V’s,” that is, a data processing problem can be defined as “Big Data” if the data meets one or more of the following classifications:
  1. Volume – A huge volume of data must be processed, typically hundreds of terabytes or more.
  2. Variety – The data is unstructured, or consists of a mix of structured and unstructured data in many formats.
  3. Velocity – New data is generated at frequent intervals, often as a constant stream of data values.
Now, some examples of typical “Big Data” problems might include:
  • Analysing web server logs for high-traffic websites
  • Extracting data from social media streams to enable “sentiment analysis”
  • Processing high volumes of data generated by sensors or devices to detect anomalies
So, where does Hadoop fit in? Hadoop is an open source solution for processing “Big Data” in a distributed cluster of servers. The project itself is managed by Apache and there are numerous distributions of Hadoop available from just as many vendors. At its core, Hadoop is a cluster of servers running a distributed files system called HDFS (Hadoop Distributed File System). Each cluster has a ‘Name Node’ that receives incoming requests and coordinates data processing, and one or more Data Nodes that process the data files stored in HDFS. Data is processed using a technique called “Map/Reduce.” Each Data Node initially processes a subset of the data to summarise it i.e. the “Map” phase. The output from each node is collated and aggregated into a single result i.e. the “Reduce” phase. There are a number of related projects to Hadoop. You may come across one or more of the following:
  • Hive – create tabular abstractions over your data in HDFS and use a SQL-like language – HiveQL
  • HCatalog – abstract Hive tables from HDFS
  • Pig – processing engine to express Map/Reduce as a sequence of steps using a procedural language named Pig Latin
  • Oozie – framework for creating automated jobs that coordinate Map/Reduce tasks
  • Mahout – a machine learning language for data mining in HDFS
Alrighty, enough yabbering, I promised you that Microsoft has something for us. Indeed it does, in fact, three things. 1. Microsoft has SQL Server Parallel Data Warehouse This is the big end of town data warehouse. In fact, it comes only as a preinstalled and configured data warehouse appliance. You’d buy one of these things from the major hardware vendors. These things are massive! They’re basically racks of servers. There is a control node and multiple compute/storage nodes (sound familiar?). Obviously these things can be scaled to meet your own particular requirements. This is great for your on-premises data warehouse for analysis and reporting. Fantastic! 2. Microsoft has created HDInsight What is it? It’s a cloud based distribution of Hadoop that is available as a service in Windows Azure. Coolio! This means you can bring HUGE amounts of processing power to bear on your BIG DATA! The HDFS is actually based on Windows Azure Blob storage, and you can provision and decommission as many multi-node HDInsight clusters as you need. You only pay for the processing services you’ve actually used. Awesome! 3. Using PolyBase  If you need to use both your SQL Server Parallel Data Warehouse and Windows Azure HDInsight (who ARE you? NSA or something?), you can use PolyBase. PolyBase allows you to define tables in SQL Server Parallel Data Warehouse that reference data in HDInsight. You can query both using Transact-SQL syntax. Client applications need not be aware of the actual storage location of the data. That’s quite a bit to take in. Hopefully, you didn’t TL;DR (too long; didn’t read) me at the beginning and you got this far. If you’re interested in finding out some more, there’s enough to get you started in the new SQL Server 2014 update course, 10977: Updating Your SQL Server Skills to Microsoft SQL Server 2014.

How do your Excel skills stack up?   

Test Now  

About the Author:

Adam Keats  

With over 25 years of real-world IT experience Adam is one of New Horizons’ most senior Database and Software Development trainers. Starting out as a physics teacher Adam displayed exceptional mathematical and technical capabilities early on in his career. He went on to work in a variety of programming and technical management roles within several government agencies including the Department of Defence and Maritime Patrol. In 1998 Adam found his true calling, gaining MCT status and thus beginning his technical training career specialising in SQL Server administration, development, BI, and .NET development. Since then he has worked for several training organisations and found a home at New Horizons where he is now our resident Database and Development specialist. Throughout his tenure at New Horizons, Adam has assisted over 500 students in their endeavours to improve their skills, knowledge, and to achieve industry certifications.

top