Thursday, March 14, 2013

Content Metadata UDJC step (using Apache Tika)

I recently stumbled across the Apache Tika project, which is a content analysis toolkit that offers great capabilities such as extracting metadata from various documents.  Depending on the document type, various kinds of metadata are available.  Some examples of metadata include MIME type, last modified date, author, publisher, etc.

I think (more) content analysis would be a great capability to add the Pentaho suite (especially Pentaho Data Integration), so I set out to write a quick UDJC step using Tika, followed by a sample transformation to extract document metadata using that step:

The first thing I noticed when I started writing the UDJC code to interface with Tika is that most of the useful code for recognizing document types and outputting to various formats is buried as private in a class called TikaCLI.  It appears the current state of Tika is such that it is meant to be used as a command-line tool that can be extended by adding your own content types, output formats, etc.  However, for this test I just wanted to be able to use Tika programmatically from my code.  Since Tika is licensed under Apache 2.0, I simply grabbed the sections of code I needed from TikaCLI and pasted them into my UDJC step.

The actual UDJC step processing basically does the following:

  1. Reads in a filename or URL and converts it (if necessary) to a URL
  2. Calls Tika methods to extract metadata from the document at the URL
  3. For each metadata item (a property/value pair), create a new row and add the property and value
  4. If Tika throws any errors and the step is connected to a error-handling step, send the error row(s)
I ran the sample transformation on my Downloads directory, here is a snippet of the output:

If you know ahead of time which metadata properties you want, you can use a Row Denormaliser step to have the properties become field names, and their values be the values in those fields.  This helps reduce the amount of output, since the denormalizer will output one row per document, whereas the UDJC step outputs one row per metadata property per document.  For my example transformation (see above), I chose the "Content-Type" property to denormalise.  Here is the output snippet corresponding to the same run as above:

Tika does a lot more than just metadata extraction, it can extract text from document formats such as Microsoft Word, PDF, etc. and it can even guess the language (English, French, e.g.) of the content.  Adding these features to PDI would be a great thing, and if I ever get the time, I will create a proper "Analyze Content" step, using as many of Tika's features as I can pack in :)  We could even integrate the Automatic Documentation Output functionality by adding content recognizers and such for PDI artifacts like jobs and transformations.

The sample transformation is on GitHub here.  As always, I welcome your questions, comments, and suggestions. If you try this out, let me know how it works for you. Cheers!

Friday, March 8, 2013

Pentaho Data Integration 4.4 and Hadoop 1.0.3

While working with a few new Hadoop-based technologies (blog posts to come later), the need arose to get Pentaho Data Integration (PDI) and its Big Data plugin (source available on GitHub) working with an Apache Hadoop 1.0.3 cluster.  Currently, PDI 4.4 only supports the following distributions (and any distributions compatible with them):

  • Apache Hadoop 0.20.x (hadoop-20)
  • Cloudera CDH3u4 (cdh3u4)
  • Cloudera CDH4 (cdh4)
  • MapR (mapr)

The values in parentheses in the list above are the folder names under the Big Data plugin's "hadoop-configurations", each of which contains JARs and other resources needed to run PDI against a particular distribution.  To select a distribution for PDI to use, you edit the file in the Big Data plugin's root folder and set the "active.hadoop.configuration" property to one of the folder names above.  The default setting is for Apache Hadoop 0.20.x:


Apache Hadoop 1.0.3 is not compatible with the Apache Hadoop 0.20.x line, and thus PDI doesn't work with 1.0.3 out-of-the-box.  So I set out to find a way to make that happen.

First, I simply copied the hadoop-20 folder to a "hadoop-103" folder in the same directory (pentaho-big-data-plugin/hadoop-configurations/).  Then I replaced the following JARs in the client/ subfolder with the versions from the Apache Hadoop 1.0.3 distribution:


and I added the following JAR from the Hadoop 1.0.3 distribution to the client/ subfolder as well:


Then I changed the property in to point to my new folder:


Then I started PDI and was able to use steps like Hadoop Copy Files and Pentaho MapReduce (see the Wiki for How-Tos).

NOTE: I didn't try to get all functionality working or tested.  Specifically, I didn't try anything related to Hive, HBase, Sqoop, or Oozie.  For Hive, I'm hoping the PDI client will work against any Hive server running on an Apache Hadoop 0.20.x cluster, or any compatible configuration.  If I test any of these Hadoop technologies, I will update this blog post.

If you try this procedure (for 1.0.3, 1.0.x, or any other Hadoop distribution), let me know if it works for you, especially if you had to do anything I haven't listed here :)  Cheers!

Sunday, January 13, 2013

New PDI/Kettle project structure

In case you haven't heard, the Kettle project in Subversion has been restructured to be cleaner and to use Apache Ivy for dependency management.  This has been a long time coming, and PDI/Kettle is now more consistent with other Pentaho projects.  The "cut-over" from the old project to the new is scheduled to occur on Monday, January 14, 2013.

If you currently have changes in a working copy of Kettle trunk, you should not commit them into the new structure as it has changed.  For example, all the Kettle modules' source code used to reside in folders named src-<module_name>, such as src-core or src-db.  The modules have been reorganized such that you can check out and build individual modules if you choose.  So now each module has its own folder under the root, such as core/ and db/.  Inside these folders are src/ folders, which contain the same files and package structure as the old Kettle project.  So the files that used to be in src-core/ are now in core/src.

Other structural changes may impact your working copy as well. For example, the old "ui/" folder is now located at "assembly/package-res/", because "ui" is a Kettle module so the "ui" folder now contains the contents of "src-ui/".  The Ant build scripts have been updated to reflect this, and there is now a "create-dot-classpath" Ant target that will generate a ".classpath" and "<project_name>.launch" file to get you up and going in Eclipse.

For more information, consult the README.txt file in the root folder, as well as the readme.txt file in the plugins/ folder.

I wrote a quick Groovy script to try and provide a mapping of any changed files you may have in your current working copy, so you will know where to commit the changes. I tried to extend it to create diff files to be used as patches, but I could never get it to work very well, probably because I'm using Cygwin on Windows with its Subversion command-line client.  It shouldn't be too hard for Linux users familiar with Groovy to extend the following script to try and automate the patch creation process.

The script is very simple and looks like this:

def module_map = ['src-core':'core/src',

'svn status'.execute().text.eachLine {line ->
    def svn_op = line.charAt(0)
    def old_fname = line.substring(1).trim()
    def path_segments = old_fname.split('/')
    def old_module = path_segments[0]
    def new_module = module_map.get(old_module)
    def file_mapping = new_module ? "$old_fname -> ${new_module}/${path_segments[1..-1].join('/')}" : old_fname
    println "${svn_op}\t${file_mapping}"

I also put the script on Gist here.

The module-map is the key to the location of the restructured files, the script simply calls "svn status" and for each changed file in your working copy, it will use the module-map to show the new location of the file.  Note that some files will not be mapped to a new location; this is either because the location hasn't changed, or because I forgot a mapping :-P

To use it, simply create a script called merge_helper.groovy (or whatever you want to call it) with the above contents and place it in your working copy of the old project structure.  Run the script with the command "groovy merge_helper.groovy" and it should show you output that will look something like this:

M       .classpath
M       src-plugins/market/src/org/pentaho/di/core/market/ -> plugins/market/src/org/pentaho/di/core/market/
M       src-plugins/market/src/org/pentaho/di/core/market/messages/ -> plugins/market/src/org/pentaho/di/core/market/messages/
M       src-plugins/market/src/org/pentaho/di/ui/spoon/dialog/ -> plugins/market/src/org/pentaho/di/ui/spoon/dialog/
M       src-ui/org/pentaho/di/ui/spoon/job/ -> ui/src/org/pentaho/di/ui/spoon/job/

If you have changes to ".classpath", be warned that there is no version-controlled file called ".classpath" any longer.  If you have new JARs or source folders to commit, please consult the README files for guidance on how to update the appropriate files. Also you can comment on this blog post with any questions about the migration process.

We hope you will find the new project structure easier to use, and the use of Apache Ivy will allow us to avoid many of the headaches that come with upgrading third-party JARs, especially to ensure that Pentaho products (which depend on each other) are using the same (or compatible) versions of their dependencies.