May 16, 2014
Hadoop. Hadoop. Hadoop. Big Data. Big Data. Big Data.
How many times have you heard the above terms in the last six months? How many times do you think one or the other, or even both, have been used by someone who has no clue as to what they mean? I reckon you’ll agree that those numbers coincide pretty well.
Unfortunately, terms like these are used quite often without the person using them knowing what they’re about. “Cloud” was a term used in this manner only a short time ago. So we’ve got a new term, or terms, to bandy about. Great! Have you ever heard of “Buzzword Bingo?” I’ve played it from time to time, it’s fun. I’m sure you’ve experienced something similar.

In the meantime though, let’s see if we can unravel some of this stuff. You know that the reason I want to do that is because Microsoft has something for us…am I that transparent? Well, yes, yes I am. Microsoft has some pretty cool “
Big Data” tools for us to use. Yes, I know this is scary, but stay with me. Hopefully, I can manage to explain at least some of this so it’s not so daunting, and believe me, some of this stuff is quite complex.
Firstly, we will need to establish just what “Big Data” is. Companies, individuals, services and even devices are generating huge amounts of data every day, and that’s a trend that is increasing sharply. Social media, photography and video capture, profile data to personalise your online experience and many others, have led to a massive increase in the amount of data available for processing. Organisations are quickly realising the value of using this “Big Data” in much of the same way as they have traditionally used their own internal data for Business Intelligence.
So generally, this “Big Data” is too large or too complex to manage and process in a traditional relational database or even a data warehouse. Systems like
SQL Server 2014 are more than capable of handling many terabytes of data. Some organisations, however, are faced with dealing with multiple petabytes of data in multiple, non-uniform, non-relational formats. Let’s just consider that for a moment. A petabyte is 1024 terabytes. If you counted all the bits in one petabyte at one bit per second, it’d take you 285
million years. That’s a LOT OF DATA and it’s only going to get bigger!
“Big Data” is typified by the so-called “Three V’s,” that is,
a data processing problem can be defined as “Big Data” if the data meets one or more of the following classifications:
- Volume – A huge volume of data must be processed, typically hundreds of terabytes or more.
- Variety – The data is unstructured, or consists of a mix of structured and unstructured data in many formats.
- Velocity – New data is generated at frequent intervals, often as a constant stream of data values.
Now, some examples of typical “Big Data” problems might include:
- Analysing web server logs for high-traffic websites
- Extracting data from social media streams to enable “sentiment analysis”
- Processing high volumes of data generated by sensors or devices to detect anomalies
So, where does
Hadoop fit in?
Hadoop is an open source solution for processing “Big Data” in a distributed cluster of servers. The project itself is managed by Apache and there are numerous distributions of Hadoop available from just as many vendors.
At its core, Hadoop is a cluster of servers running a distributed files system called
HDFS (Hadoop Distributed File System). Each cluster has a ‘Name Node’ that receives incoming requests and coordinates data processing, and one or more Data Nodes that process the data files stored in HDFS. Data is processed using a technique called “Map/Reduce.” Each Data Node initially processes a subset of the data to summarise it i.e. the “Map” phase. The output from each node is collated and aggregated into a single result i.e. the “Reduce” phase.
There are a number of related projects to Hadoop. You may come across one or more of the following:
- Hive – create tabular abstractions over your data in HDFS and use a SQL-like language – HiveQL
- HCatalog – abstract Hive tables from HDFS
- Pig – processing engine to express Map/Reduce as a sequence of steps using a procedural language named Pig Latin
- Oozie – framework for creating automated jobs that coordinate Map/Reduce tasks
- Mahout – a machine learning language for data mining in HDFS
Alrighty, enough yabbering, I promised you that Microsoft has something for us. Indeed it does, in fact, three things.
1. Microsoft has SQL Server Parallel Data Warehouse
This is the big end of town data warehouse. In fact, it comes only as a preinstalled and configured data warehouse
appliance. You’d buy one of these things from the major hardware vendors. These things are massive! They’re basically racks of servers. There is a control node and multiple compute/storage nodes (sound familiar?). Obviously these things can be scaled to meet your own particular requirements. This is great for your on-premises data warehouse for analysis and reporting. Fantastic!
2. Microsoft has created HDInsight
What is it? It’s a cloud based distribution of Hadoop that is available as a service in Windows Azure. Coolio! This means you can bring HUGE amounts of processing power to bear on your BIG DATA! The HDFS is actually based on Windows Azure Blob storage, and you can provision and decommission as many multi-node HDInsight clusters as you need. You only pay for the processing services you’ve actually used. Awesome!
3. Using PolyBase
If you need to use both your SQL Server Parallel Data Warehouse and Windows Azure HDInsight (who
ARE you? NSA or something?), you can use
PolyBase. PolyBase allows you to define tables in SQL Server Parallel Data Warehouse that reference data in HDInsight. You can query both using Transact-SQL syntax. Client applications need not be aware of the actual storage location of the data.
That’s quite a bit to take in. Hopefully, you didn’t TL;DR (too long; didn’t read) me at the beginning and you got this far. If you’re interested in finding out some more, there’s enough to get you started in the new SQL Server 2014 update course,
10977: Updating Your SQL Server Skills to Microsoft SQL Server 2014.
How do your Excel skills stack up?
Test Now