Saturday, September 20, 2014

PDI plugins and Dependency Hell

I've written quite a few plugins for Pentaho Data Integration, some are "finished" in terms of being in the PDI Marketplace, and some are still works in progress, Proofs of Concept, etc.  The usual pattern for my plugins is to integrate some 3rd party open-source tech with PDI. For one-JAR solutions, this is pretty straightforward and rarely causes problems.  For larger projects, however, this can be quite painful.

When a plugin is loaded into PDI, a new self-first ClassLoader is created, the JARs at the root of the plugin folder are scanned for plugin type annotations and plugin.xml files, and then the JARs in the lib/ folder of the plugin are added to the ClassLoader. This allows plugins to (sometimes) override the versions of JARs that are provided with PDI, and to provide their own JARs in a sandbox of sorts, so as not to infiltrate or influence other plugins.

However there are a few JARs that always seem to cause problems, where a plugin's version of the JAR will not play well with the version provided by PDI. Slf4J is one of these, as are the Jersey JARs. A lot of it has to do with which PDI classes are trying to use which of its own loaded classes while operating using a plugin which might have the same class in its own classloader. In this case, there are two separate versions of the class in the JVM, one loaded by the App classloader, and one by the plugin's classloader. As long as it's only the plugin referencing the class, there should be no problem; however, if the "duplicate" class is passed to a PDI-loaded class, then a ClassCastException can occur, as even though they appear to be the same class (same fully-qualified class name, e.g.), they are different due to the different classloaders.

I've run into this a few times in my projects, so I finally wised up and wrote a small Groovy script to look for common JARs between two paths. This script ignores the versions on the JARs (assuming they're named in "normal" style (name-version.jar). It doesn't work for JARs with classifiers (yet), it just does a quick substring looking for the last dash in the name.  For each JAR "common" to the two paths, it then prints the versions in each path.  This allows me to use transitive dependencies when building my plugin package, but then I can inspect which JARs might cause problems as they will already be provided by the PDI runtime when the plugin is deployed.

Here's the script code (Gist here):

commonLibs = [:]
leftLibs = []
rightLibs = []
if(args?.length > 1) {
    new File(args[0]).eachFileRecurse {
     fileName = (it.name - '.jar')
     fileNameNoVersion = fileName[0..fileName.lastIndexOf('-')-1]
     leftLibs += fileNameNoVersion
     commonLibs.putAt(fileNameNoVersion, commonLibs[fileNameNoVersion] ? (commonLibs[fileNameNoVersion] + [1:fileName]) : [1:fileName])
    }
    new File(args[1]).eachFileRecurse {
     fileName = (it.name - '.jar')
     fileNameNoVersion = fileName[0..fileName.lastIndexOf('-')-1]
     rightLibs += fileNameNoVersion
     commonLibs.putAt(fileNameNoVersion,  commonLibs[fileNameNoVersion] ? (commonLibs[fileNameNoVersion] + [2:fileName]) : [2:fileName])
    }
    leftLibs.intersect(rightLibs).each { lib ->
      println "${commonLibs[lib]}"
    }
}
else {
  // print usage
  println 'Usage: jardiff  \nDisplays the JARs common to both paths (ignoring version) and the versions of those JARs in each path.'
}


When I have more time and interest, I will switch the lastIndexOf() to a regex instead, to make it more robust, and I'm sure I can make things more "Groovy". However I just needed something to get me unstuck so I can start integrating more cool tech into PDI. Stay tuned to this blog and the Marketplace, hopefully I'll have something new soon :)

Cheers!

No comments:

Post a Comment