Fun with Pentaho Data Integration: SuperScript PDI plugin

As readers of my blog know, I'm a huge fan of scripting languages on the JVM (especially Groovy), and of course I'm a huge fan of Pentaho Data Integration :) While using the (experimental) Script step to do various things, I saw a few places where a script step could be improved for easier use and more powerful features. Specifically I wanted:

A drop-down UI for selecting the scripting engine
Allow non-compilable JSR-223 scripting (such as AppleScript)
Enable the use of the script's returned value as an output field
Enable the use of the script step as an input step (doesn't need a trigger)

To that end, I set out to refactor the existing Script step, and I'm happy to announce that the SuperScript step is now available in the PDI Marketplace:

As you can see from the screenshot above, I get a list of the output fields, which are the input fields plus the fields I specified in the table at the bottom of the dialog. Also notice that I did not define the variable/field "c" in the script, but I set "Script result?" to Y for "c", and thus "c" will contain the result of the script evaluation (in this case, b * rowNumber).

The following variables are available to each script:

step - A reference to the SuperScript step object
stepName - the name of the SuperScript step
transName - the name of the transformation
row - the current row's data
lastRow - the last row's data
rowMeta - the metadata about the rows (field types, e.g.)
rowNumber (starts with 1 like the rest of PDI)
SKIP_TRANSFORMATION
ABORT_TRANSFORMATION
ERROR_TRANSFORMATION
CONTINUE_TRANSFORMATION

These last 4 function the same way as they do in the Script step. Users of the Script (or Modified JavaScript) step will notice that I removed leading underscores from some of the variables, this is to support script engines that don't support leading underscores in variable names.

A noticeable addition is the "lastRow" variable, this will contain null (or be undefined) for the first row but will contain the previous row's data for all subsequent rows. This opens the door for more powerful processing, such as filling empty fields with the previous row's value, changing script behavior based on if a field value has changed since the last row, etc. UPDATE: Here is a screenshot of an example script that will fill the field (if null) with the previous field's value (if not null):

Other helpful API calls include the following:

step.getTrans() (or just step.trans in Groovy) - gets a reference to the Transformation object

step.getTrans().findStepInterface("another step") - Gets a reference to a different step

step.putRow(rowMeta, rowData) - Adds a row to the output stream

As I mentioned before, the existing Script step must be "primed" so that it will run at least once. Usually this is done with a Generate Rows step that puts out 1 or more rows. Sometimes the script itself will generate rows, so I wanted SuperScript to run at least once, whether there was an incoming row or not:

Perhaps the most fun and powerful addition is the ability of SuperScript to execute any JSR-223 Script Engine. The existing Script step requires that the Script Engine produce CompiledScript(s), which of course is the fastest but not always available. To that end, SuperScript will attempt to compile the script first, and if it cannot, it will fall back to evaluating (i.e. interpreting) the script(s). This opens the door for a lot of new scripting languages, such as Jython, AppleScript, and Renjin (an R ScriptEngine for the JVM).

To illustrate this, and to have "Fun with Pentaho Data Integration", I created a transformation with two SuperScripts running AppleScript, one just after a Generate Rows step, and one just before the last step. The first AppleScript starts iTunes, and the last one quits it. For long-running transformations, this pattern can be used to provide some musical entertainment while you wait for the transformation to finish :)

NOTE: The AppleScript script engine is afaik only available on the Mac and comes with its JDK.

I hope you find this plugin helpful, and I'd love to know how/if you are using it, and also how to improve it. One improvement I hope to add is the ability to choose different plugin(s) to link to, in order to use their capabilities. This would likely be done by creating a chained self-first classloader, and could be useful for things like Big Data capabilities from inside the SuperScript step.

The code is Apache 2.0-licensed and available on GitHub at https://github.com/mattyb149/pdi-scriptengine-plugin

As always, I welcome all comments, questions, and suggestions. Until next time, have fun with Pentaho Data Integration :)

Cheers!

Fun with Pentaho Data Integration

Tuesday, December 30, 2014

SuperScript PDI plugin

No comments:

Post a Comment

About Me

What I'm Reading