Friday, March 8, 2013

Pentaho Data Integration 4.4 and Hadoop 1.0.3

While working with a few new Hadoop-based technologies (blog posts to come later), the need arose to get Pentaho Data Integration (PDI) and its Big Data plugin (source available on GitHub) working with an Apache Hadoop 1.0.3 cluster.  Currently, PDI 4.4 only supports the following distributions (and any distributions compatible with them):


  • Apache Hadoop 0.20.x (hadoop-20)
  • Cloudera CDH3u4 (cdh3u4)
  • Cloudera CDH4 (cdh4)
  • MapR (mapr)


The values in parentheses in the list above are the folder names under the Big Data plugin's "hadoop-configurations", each of which contains JARs and other resources needed to run PDI against a particular distribution.  To select a distribution for PDI to use, you edit the plugin.properties file in the Big Data plugin's root folder and set the "active.hadoop.configuration" property to one of the folder names above.  The default setting is for Apache Hadoop 0.20.x:

active.hadoop.configuration=hadoop-20

Apache Hadoop 1.0.3 is not compatible with the Apache Hadoop 0.20.x line, and thus PDI doesn't work with 1.0.3 out-of-the-box.  So I set out to find a way to make that happen.

First, I simply copied the hadoop-20 folder to a "hadoop-103" folder in the same directory (pentaho-big-data-plugin/hadoop-configurations/).  Then I replaced the following JARs in the client/ subfolder with the versions from the Apache Hadoop 1.0.3 distribution:

commons-codec-<version>.jar
hadoop-core-<version>.jar

and I added the following JAR from the Hadoop 1.0.3 distribution to the client/ subfolder as well:

commons-configuration-<version>.jar

Then I changed the property in plugins.properties to point to my new folder:

active.hadoop.configuration=hadoop-103

Then I started PDI and was able to use steps like Hadoop Copy Files and Pentaho MapReduce (see the Wiki for How-Tos).

NOTE: I didn't try to get all functionality working or tested.  Specifically, I didn't try anything related to Hive, HBase, Sqoop, or Oozie.  For Hive, I'm hoping the PDI client will work against any Hive server running on an Apache Hadoop 0.20.x cluster, or any compatible configuration.  If I test any of these Hadoop technologies, I will update this blog post.

If you try this procedure (for 1.0.3, 1.0.x, or any other Hadoop distribution), let me know if it works for you, especially if you had to do anything I haven't listed here :)  Cheers!



1 comment:

  1. Hi Matt, your configuration works also with 1.0.4.
    Thanks,
    Davide

    ReplyDelete