Big Data: How IT contractors can leverage it for their own ends

Big Data is one of the big buzzwords for 2013. It’s being described by McKinsey & Company as no less than the “next frontier for innovation, competition and productivity.”

But it’s not just on the lips of consulting giants – in fact, everyone is talking about Big Data: TV documentaries are being made about it, and salaries for its practitioners are rising through the roof and, as we all know, where’s there’s a buzz and big bucks to be made a training course is sure to follow, writes Edward Jones of Firebrand Training.

A brief history of Big Data

The origin of Big Data goes back to the ‘90s and the huge IT projects in government, science, the military and in financial services after its big bang. It’s in these areas that we first started to see data collected in huge quantities and it was also in these areas where there was the money and the brains to do something with all that data, as dealing with it required fast prohibitively expensive supercomputers and specialised software. And that’s where it would have remained had it not been for the internet and business intelligence (BI) systems.

In the early part of this century, internet businesses like Google and Amazon realised that they had a problem. What Google needed was to be able to return a search of billions of web pages, and at the same time intelligently sell adverts based on the content of that page, to hundreds of thousands of businesses, at the best rate possible, and to then display the adverts next to the search results in tens of milliseconds. Ideally they wanted to do this in real-time, and they needed to do it cost-effectively, preferably without spending huge amounts on databases and supercomputers.

At the same time as Google was hitting this problem the emerging science of Business Analytics was also starting to come up with a similar problem. Big multinational businesses were starting to collect data on a big scale, electronic tills and loyalty cards produced gigabytes of data on customers, while the web produced even more data on where customers searched and for how long - and that was before they thought about adding data from their CRM and ERP systems. The upshot was all this data was simply proving too much for the standard BI systems.

The brains and money of Silicon Valley were put into action to try and solve these problems and in 2004 Google published a paper describing a parallel processing solution known as MapReduce. With this new process Google was able to split large databases and their inherent problems into smaller pieces running on a large array of standard and therefore cheap x86 PCs – the Map piece - and then put the full answer together and return the answer – the Reduce piece.

The birth of Hadoop – a spark

Seeing the advantages that this could bring the Google MapReduce system and the Google File System (GFS) was taken on board by other businesses and eventually ported to Apache by Yahoo employee Doug Cutting and Mike Cafarella in 2005. In a tribute to Doug’s son's toy elephant the system was named Hadoop, was made into an Open Source solution and extended to become the Hadoop kernel, MapReduce and Hadoop Distributed File System (HDFS). It was at this point that the solution started to spread like wildfire, and this is where the training starts to come in. 

While Big Data is starting to become much more accessible – thanks to cloud services and particularly systems like Amazon’s EC2/S3 services where the hardware and systems can be ‘built’ in minutes - it’s still a relatively complex process to port a problem to Hadoop, and it’s something that requires a skill-set that cannot be learnt from a book or a University degree. In short, Hadoop eats PhDs for breakfasts and if you want to be an expert you’re going to need to get on a training course, and there are already a reasonable amount to choose from.

Big Data Courses

Most of the current crop of training courses are mostly centred on developers – although some for administrators do exist - and look at porting data and problems to the Hadoop framework (Hadoop Kernel , MapReduce and HDFS), as well as using many of the other components  such as  Ambari, Avro, Flume, HBase, Hcatalog, Hive, JAQL, Mahout, Pig, and ZooKeeper, and tools such as Oozie and Sqoop. 

Hadoop accredited courses are available from HortonWorks  – a company formed by the key architects, builders and operators from the original Yahoo Hadoop software engineering group. The company offers two accreditations the Hortonworks Certified Apache Hadoop Developer (HCAHD) and the Hortonworks Certified  Apache Hadoop Administrator (HCAHA). 

If you’re looking for a big name then Intel runs courses in Hadoop and offers the Intel Certification for Apache Hadoop Professional (ICAHP) certification, which covers the Intel distribution of Hadoop and HBase - but you’ll need to go to the US to do a course.  Alternatively there’s IBM’s Big Data University initiative which offers a mix of free and paid courses on Hadoop, Amazon AWS and IBM’s InfoSphere. It has courses with tests but offers no formal certification.

Other courses are based around third-party distributions of Hadoop such as Cloudera, which claims to be the most widely deployed distribution of Hadoop - and its Cloudera Certified Developer for Apache Hadoop (CCDH) and Cloudera Certified Administrator for Apache Hadoop (CCAH) are industry-recognised Hadoop accreditations.  As with most accredited training the test includes real-life scenarios that are unlikely to appear in the courseware and will require an advanced level of Apache and Hadoop knowledge.  It’s also worth noting that while the Cloudera certifications are for the lifetime of the individual, the actual certifications are linked to a specific release of Cloudera and need to be updated as the distribution is updated.

A question of speed…

At the start of the article, it was said that the Big Data solution everyone is looking for is something that delivers in real-time and Hadoop isn’t that solution. Hadoop is quick but essentially it’s a batch solution. If you want to analyse your data in real-time then you need Storm. Developed by Twitter as a way of dealing with huge amounts of searches on the fly, Storm was released as an open source tool in 2010 and like Hadoop it’s complicated and needs a skill-set that as yet isn’t learnt in any university. Unfortunately, as yet there are also no official training courses so the only way to learn Storm is through trial and error. 

A final word

While there’s a shortage in the West of Big Data experts, there’s a growing interest by businesses in the East who may be looking to corner the market.  A quick look at Google Trends shows that interest in Big Data is growing rapidly and the country that is searching the most is India. It therefore seems to fair to say that while the West jostles among itself for position, it is the subcontinent where the ‘next frontier for innovation’ can be found and is gaining ground.

Jun 13, 2013