Wednesday, October 10, 2012

Getting Groovy with PDI

I'm a huge fan of Pentaho Data Integration and also of the Groovy programming language.  Since PDI is written in Java and Groovy and Java are so tightly integrated, it seems only natural to bring some Groovy-ness into PDI :)

Some Groovy support already exists in the form of the Script step in the Experimental category.  This step allows any language that supports JSR-223 to be used in scripting a PDI step.  Also I believe there is some Groovy support in the Data Mining product in the Pentaho suite.

What I wanted was to offer a Groovy console to the Spoon user where he/she could interact with the PDI environment.  It turns out that Groovy's groovy.ui.Console class provides this capability.  So I created a Spoon plugin with a Tools menu overlay offering the Groovy console:



And before it opens (the first time takes a while as it loads all the classes it needs including Swing), it adds bindings to the singletons in the PDI environment:

- spoon: this variable is bound to Spoon.getInstance(), and thus offers the full Spoon API such as executeTransformation(), getActiveDatabases(), etc.

- pluginRegistry: this variable is bound to PluginRegistry.getInstance(), and offers the Plugin Registry API, such as findPluginWithName(), etc.

- kettleVFS: this variable is bound to KettleVFS.getInstance(), and offers such methods as getTextFileContent(), etc.

- slaveConnectionManager: this variable is bound to SlaveConnectionManager.getInstance(), and offers such methods as createHttpClient

In addition to the above singletons, I also added the following variables that are resolved when the console comes up:

- trans: This variable is resolved to Spoon.getInstance().getActiveTransformation()


- defaultVarMap: This variable is resolved to KettleVariablesList.getInstance().getDefaultValueMap()

- defaultVarDescMap: This variable is resolved to KettleVariablesList.getInstance().getDescriptionMap()

- transGraph: This variable is resolved to Spoon.getInstance().getActiveTransGraph());

Then I added a few helper methods:

- methods(Object o): This method returns o.getClass().getDeclaredMethods()

- printMethods(Object o): This method will print a line with information for each of the methods obtained from calling methods() on the specified object

- props(Object o): This is an alias for the properties() method

- properties(Object o): This method returns o.getClass().getDeclaredFields()

- printProperties: This method will print a line with information for each of the fields obtained from calling properties() on the specified object

- activeTrans: This method calls Spoon.getInstance().getActiveTransformation(), and is thus more up-to-date than using the "trans" variable.


- activeTransGraph: This method calls Spoon.getInstance().getActiveTransGraph(), and is thus more up-to-date than using the "transGraph" variable.


- database(String dbName): This method returns the DatabaseMeta object for the database connection with the specified name. It searches the active databases (retrieved from Spoon.getInstance().getActiveDatabases())


- step(String stepName): This method returns the StepMeta object for the step in the active transformation with the specified name.

- createdb(Map args): This method takes named parameters (passed to Java as a Map) of name, dbType, dbName, host, port, user, and password, and creates and returns a DatabaseMeta object

Once all the bindings are done, the console is displayed:


This allows a Groovy way of doing inspection of the PDI environment, but since almost all of the API is exposed in one way or another, you can also edit metadata.  For example, in the above transformation I'd like to run against two different database connections.  Without this plugin you run the transformation manually against one database connection, then edit the steps to switch the connection, then run the transformation again.  With the Groovy Console you can automate this:


Right now I'm having PermGen memory issues when running transformations, so until I get that figured out the console is probably best used for looking at and editing metadata rather than executing "power methods" like executeTransformation().

Besides swapping database connections, you can also create a new one using the createdb() helper method. Here's an example of this being done, as well as testing the connection and adding it to the active transformation:


Using the API, you should also be able to create new steps, transformations, etc.  I'd be interested to hear any uses you come up with, so please comment here if and how you've used this plugin.

The code and downloads are up on GitHub, just unzip the folder into plugins/spoon.  Cheers!

UPDATE: It appears that in PDI 4.3.0 there is an older version of the Groovy JAR in libext/reporting, which causes the console plugin not to work. You can try replacing that JAR with the one included in the plugin, but I'm not sure if the reporting stuff will continue to work.  In the upcoming release, the version of the Groovy JAR has been updated to 1.8.0.

No comments:

Post a Comment