May 16, 2014
Hadoop. Hadoop. Hadoop. Big Data. Big Data. Big Data. How many times have you heard the above terms in the last six months? How many times do you think one or the other, or even both, have been used by someone who has no clue as to what they mean? I reckon you’ll agree that those numbers coincide pretty well. Unfortunately, terms like these are used quite often without the person using them knowing what they’re about. “Cloud” was a term used in this manner only a short time ago. So we’ve got a new term, or terms, to bandy about. Great! Have you ever heard of “Buzzword Bingo?” I’ve played it from time to time, it’s fun. I’m sure you’ve experienced something similar. In the meantime though, let’s see if we can unravel some of this stuff. You know that the reason I want to do that is because Microsoft has something for us…am I that transparent? Well, yes, yes I am. Microsoft has some pretty cool “Big Data” tools for us to use. Yes, I know this is scary, but stay with me. Hopefully, I can manage to explain at least some of this so it’s not so daunting, and believe me, some of this stuff is quite complex. Firstly, we will need to establish just what “Big Data” is. Companies, individuals, services and even devices are generating huge amounts of data every day, and that’s a trend that is increasing sharply. Social media, photography and video capture, profile data to personalise your online experience and many others, have led to a massive increase in the amount of data available for processing. Organisations are quickly realising the value of using this “Big Data” in much of the same way as they have traditionally used their own internal data for Business Intelligence. So generally, this “Big Data” is too large or too complex to manage and process in a traditional relational database or even a data warehouse. Systems like SQL Server 2014 are more than capable of handling many terabytes of data. Some organisations, however, are faced with dealing with multiple petabytes of data in multiple, non-uniform, non-relational formats. Let’s just consider that for a moment. A petabyte is 1024 terabytes. If you counted all the bits in one petabyte at one bit per second, it’d take you 285 million years. That’s a LOT OF DATA and it’s only going to get bigger! “Big Data” is typified by the so-called “Three V’s,” that is, a data processing problem can be defined as “Big Data” if the data meets one or more of the following classifications:- Volume – A huge volume of data must be processed, typically hundreds of terabytes or more.
- Variety – The data is unstructured, or consists of a mix of structured and unstructured data in many formats.
- Velocity – New data is generated at frequent intervals, often as a constant stream of data values.
- Analysing web server logs for high-traffic websites
- Extracting data from social media streams to enable “sentiment analysis”
- Processing high volumes of data generated by sensors or devices to detect anomalies
- Hive – create tabular abstractions over your data in HDFS and use a SQL-like language – HiveQL
- HCatalog – abstract Hive tables from HDFS
- Pig – processing engine to express Map/Reduce as a sequence of steps using a procedural language named Pig Latin
- Oozie – framework for creating automated jobs that coordinate Map/Reduce tasks
- Mahout – a machine learning language for data mining in HDFS
How do your Excel skills stack up?
Test NowNext up:
- Use slicers to filter table data in Microsoft Excel
- Hyper-V – Enhanced
- On being a Professional Development trainer
- Help! I typed in the wrong dimensions for my InDesign document!
- How to set up a Windows 7 and 8 HomeGroup
- VBA Excel: Finding the last row of a worksheet (Part 2)
- Mobilising SharePoint 2013
- Becoming a great workplace trainer starts with three words (Part 2)
- 3 programming tips in Visual Basic
- Implementing security in SQL Server 2014
Previously
- VBA Excel: Finding the last row of a worksheet (Part 1)
- EAs and PAs: Asking your boss the right questions
- How to create fillable forms in Microsoft Word
- The Exchange Admin Center (EAC) of Exchange 2013 – It’s new!
- Becoming a great workplace trainer starts with three words (Part 1)
- Response Groups in Lync Server 2013
- Calculate the Resources Standard Rate in Microsoft Project
- An introduction to cloud computing
- Turn the heat up on your text in Photoshop
- How to access Office Applications from within VBA