Managing Hadoop projects: What you need to know to succeed
A comprehensive collection of articles, videos and more, hand-picked by our editors
A number of software tools now make Hadoop somewhat easier to deploy, but channel partners contend there's more progress to be made.
The Apache Software Foundation's Hadoop distributed computing technology focuses on the "big data" problem: the challenge of working with -- and finding meaning in -- very large and cumbersome data sets. The open source framework handles big data tasks by breaking them up into smaller ones. In a Hadoop deployment, the data-crunching task is distributed across the multiple nodes of a computing cluster.
The emerging Hadoop channel works with core components such as the Hadoop Distributed File System and MapReduce, a system that distributes processing chores on a Hadoop cluster. Other important components of the Hadoop software stack include Apache's Hive data warehouse.
Building a Hadoop cluster is one thing, but turning it into a data analysis solution requires more work. Channel partners who want to do more than assemble the basic plumbing of the Hadoop platform -- such as Hadoop development -- can tap a number of tools to get the job done. Typically, solutions providers have mainly relied on open source offerings and tools from Hadoop distribution vendors such as Cloudera Inc.
But Hadoop specialists said they expect to see greater commercialization of Hadoop tools. Most tools have been geared toward down-in-the-weeds Hadoop developers, as opposed to more mainstream IT personnel. A broader commercial tool set could make projects somewhat less exotic, inspire greater confidence among IT shops, and open a bigger market for the Hadoop channel, according to industry executives.
What's on hand?
Tools for getting a Hadoop cluster up and running are fairly straightforward.
David Cole, a partner at Lunexa LLC, a boutique consultancy based in San Francisco that focuses on big data, said Cloudera Manager has become a tool of choice for running Hadoop clusters.
Cloudera provides Manager as part of its CDH Enterprise Hadoop distribution. A version of Manager that supports up to 50 nodes is included in the free edition of CDH. The complete version of Manager, which supports an unlimited number of hosts, comes with the subscription edition of CDH.
Cole said Lunexa uses Cloudera Manager on its own in-house Hadoop cluster, noting that clients use it as well.
"We found that to be the best tool out there to ... manage your clusters and figure out what jobs are running and utilization [rates]," Cole said.
Mani Chhabra, president of Cloudwick Technologies, a Hayward, Calif., company that specializes in Hadoop and big data services, said the tools for getting a bare-bones cluster operational are available from Hadoop distribution vendors, such as Hortonworks Inc. and MapR Technologies Inc., as well as Cloudera. He said those tools have become a standard, stable way to manage a cluster.
The situation becomes a bit more complicated when it comes to tools for getting data into and out of Hadoop. On the one hand, organizations can use Hive. This data warehouse infrastructure is built on Hadoop and includes Hive QL (HQL), a query language based on SQL.
Cole said many of Lunexa's enterprise customers use Hive, noting that the software works well for organizations that already possess in-house SQL expertise.
"The leap from SQL to HQL is minimal," he said.
HQL allows for sophisticated analytics, but customized Hadoop solutions call for the ability to create custom MapReduce programs using Java or Python. A few tools are available to support such Hadoop development activities.
Cole cited Cloudera's Crunch framework, a Java library, as one tool that makes the coding easier. Crunch, which Cloudera launched as a development project, was accepted into the Apache Incubator last June. The incubator serves as the gateway outside organizations use when they donate code to the Apache Software Foundation.
Cole said his company has used Crunch a few times, adding that it helps write custom MapReduce in pipeline-like fashion. Hadoop solutions are sometimes devised as multistep pipelines. A pipeline may take a raw data set through a series of steps -- including data cleansing and aggregation -- culminating in data analysis.
Crunch has sped up the Hadoop development process while helping to smooth complex data transformation tasks, according to Cole.
Lunexa also uses Karmasphere Inc.'s development environment to create custom MapReduce programs. He said Karmasphere differs from Crunch in that the former is a downloadable software tool, whereas Crunch is more of an API-like environment.
In another tool foray, Lunexa participated in Cloudera's Impala beta program. Impala aims to improve query performance, and Cole said his company has seen performance gains using the technology. He said that the most dramatic improvements were found in low-latency-type queries, which are fairly common with business intelligence tools.
Chhabra said open source tools suffice for much of the work his company needs to do. But those tools, he noted, require people with specialized knowledge to use them. Tools haven't reached the point where regular IT groups can deal with them, and that's particularly the case for Hadoop's application and security layers, he added.
"The whole ecosystem has to mature," he said. "Nothing commercial has come up in a big way yet."
Still, Chhabra said he believes that commercial vendors are moving toward filling the tool gaps. He cited examples including MicroStrategy Inc.'s link-up with Hadoop and Microsoft Corp.'s Hadoop integration, which lets users grab data from a cluster and work with it in an Excel spreadsheet. He believes that much of the essential integration will be in place by 2015.
Attunity in January released a file replication solution for Hadoop. The company's technology aims to quickly move data in and out of a cluster. Attunity doesn't require the use of other software offerings such as Hive. Matt Benati, Attunity's vice president of global marketing, said that approach simplifies matters for customers, making the company's products acceptable to a broader audience.
"We can move data directly into Hadoop and pull it out of Hadoop," Benati said. "We really don't need any other tool to do that."
As for channel activities, Attunity has a partnership with Hortonworks and also works globally with resellers that focus on business intelligence, according to Benati.
Dataguise, for its part, offers technology that protects data as it moves into Hadoop, while it is stored and when it is extracted by data analysis tools, noted Manmeet Singh, the company's CEO. Dataguise's DG for Hadoop also provides access control.
Singh said Dataguise provides security measures that Hadoop distribution vendors haven't traditionally built into their software.
"They are not even looking at security at this point," he said.
In January, Dataguise reported DG for Hadoop's certification for use with the MapR's Hadoop distribution. The company announced certification for Cloudera's distribution late last year and disclosed a partnering arrangement with Hortonworks earlier in 2012.
Singh said Dataguise works with resellers, including Compuware Corp.
Cole, meanwhile, said he is waiting to see how traditional extract, transform and load (ETL) vendors will evolve in the Hadoop space. He said enterprises have invested considerable sums in ETL vendors, such as Ab Initio Software Corp., IBM Corp. (InfoSphere DataStage) and Informatica Corp.
"So, customers who have taken the plunge into Hadoop would love to leverage their existing ETL investments," Cole explained.
He said he would like to see ETL vendors release technology that would let customers create ETL workflow that generates custom MapReduce code. He said the vendors now piggyback off of Hive. He said such arrangements are fine, but don't represent the best scenario.
"Some things are much trickier and not as efficient to do in Hive versus writing custom MapReduce," he said.
He said letting customers generate MapReduce using familiar ETL products would get them more comfortable working with Hadoop.
John Moore has written on business and technology topics for more than 25 years.