Fun with Pentaho Data Integration: 2012

Tuesday, December 4, 2012

GroovyConsoleSpoonPlugin with JSR-223 Support

I recently gave a presentation of my GroovyConsoleSpoonPlugin (see earlier posts) to the Pentaho crew, and I got a lot of great feedback on it. Specifically, Pentaho Architect Nick Baker suggested it would be good to have the Groovy-ized API available as a step, so we could leverage it during transformation execution. For the presentation I had made changes to the Kettle code to allow the adding of RowListeners and TransListeners, so the Groovy Console plugin could interact with a running transformation. However, his suggestion makes a lot of sense, as I can keep the plugin a proper plugin, with no changes to Kettle code needed.

I thought about creating a Groovy scripting step and adding my staging code to make scripting easier. However that involves a lot of setup and boilerplate code that already exists in other steps (such as the JavaScript and Script steps). I still may do that, in order to leverage the full power of the Groovy language, but in the meantime it occurred to me that I could just create a script engine wrapper (using the javax.script.ScriptEngine interface, etc.) and use the experimental Script step as the vehicle.

So to that end I added GroovySpoonScriptEngine and GroovySpoonScriptEngineFactory classes, which wrap the existing plugin code inside a JSR-223 compliant scripting engine. Then the Script step can execute Groovy code (with my staging code already injected) during a transformation.

To get Spoon to recognize my scripting engine, I had to add my plugin directory to the list of libraries when launching Spoon. This is done in launcher/launcher.properties like so:

After starting Spoon, I created a transformation with a Generate Rows (for one row) with a field called "fld1" with a value of "hello". Then I wired it to a Script step, which is wired to a Dummy step.

The key to selecting a scripting engine in the Script step is to name the step with an extension that corresponds to the engine name. So for the Groovy scripting engine the step name ends in ".groovy", and for my GroovySpoonScriptEngine the step name must end in ".groovyspoon". My sample transformation looks like this:

Inside the Script step I put some very simple code showing the use of the row data in GStrings, output variables, local variables (which are not returned to the step processor), and the Groovy-ized API available through the plugin:

The script step does have a bit of a limitation (due to its generic nature) that the output fields must be specified in the Fields table at the bottom of the dialog. This is in contrast, for example, to the User Defined Java Class (UDJC) step, which can add values to the row on-the-fly (see previous posts).

Previewing the Dummy step gives the following result:

So now the plugin supports the Groovy-ized API in the Groovy Console, the command line, and the Script step. With these three entry points, hopefully interacting with PDI will become alot easier and even more powerful!

I hope to get time to document the Groovy-ized API on the project's wiki. In the meantime, take a look in the code at staging.groovy, that contains most of the methods, properties, etc. available. In addition, of course, is the full Kettle API and all Groovy language features, so the sky's the limit when it comes to what can be done :)

The project is on GitHub under GroovyConsoleSpoonPlugin. As always, I welcome all comments, questions, and suggestions. Cheers!

Wednesday, November 7, 2012

UDJC to Verify Transformations

The "Verify Transformation" capability of Pentaho Data Integration (aka Kettle) is very handy for spotting issues with your transformations before running them. As a sanity check or as an auditing feature, I thought it would be nice to verify all transformations in the repository using a Kettle transformation.

To do this, I wrote a User Defined Java Class (UDJC) step to call the TransMeta.checkSteps() API method, just as the Spoon GUI's "Verify transformation" button does. However, instead of displaying a dialog, I put the same information out onto the stream.

In order to get access to the repository's transformations, I started with a Get Repository Names step. Since Kettle jobs don't (currently) have the same verification functionality, I set the filter to only return transformations:

NOTE: The UDJC step will accept all incoming rows (Jobs, Transformations, etc.) but will only process those whose "meta" object is of type TransMeta.

To get access to the TransMeta object for each transformation, I used the Auto Documentation Output step, with the output type set to METADATA. This puts (among other things) a field on the stream containing the TransMeta object associated with each transformation:

I wired the autodoc step to my UDJC step and output the same fields as are available in the GUI version:

If this is determined to be a useful step, I may turn it into a proper step plugin, to remove the need for the "show_successful" field, and to instead provide a dialog box to let the user choose which fields (and their names) to put out on the stream. UDJC steps are just an easy way for me to get basic functionality out there and try to get early feedback.

I created a dummy repository on GitHub so I could have a Downloads area where I will start storing sample transformations that contain UDJC steps, etc. This is slightly easier than putting the UDJC code on Gist or Pastebin, especially in this case since there are multiple steps involved. The direct link to the above transformation is here.

If you give this a try or otherwise have comments, I'm eager to hear them :)

Cheers!

Thursday, October 18, 2012

Groovy Console Spoon Plugin Update

I've been working on the Groovy Console plugin for Spoon, and I seem to have been able to sidestep the PermGen issues (at least the ones I was having during my last post). Also I added some more functionality, such as the ability to get JobMeta and JobGraph information using job, jobGraph, activeJob(), and activeJobGraph().

I really wanted to have the ability to do "automated" testing from this console, so I added a runTrans() helper method. If you pass in a TransMeta object, it will run that transformation; otherwise, it will try to run the active transformation.

However, in implementing the runTrans() method I ran into a couple of issues:

1) The execution settings dialog pops up if you run Spoon.executeTransformation() with no arguments.

2) If you are looping and trying to run the transformation once during each loop, the executions will step on each other as they are executed asynchronously.

To fix 1, I faked a call to executeTransformation() using a replay date of the current time, which prevents the execution settings dialog from being displayed. As far as getting the settings configured, that will be a little trickier. So runTrans() at the time of this writing works on transformations with no execution-specific settings, or on transformations that have already been run with execution-specific settings. This is because the transformation is effectively replayed.

To fix 2, I put in a hack after the call to executeTransformation() to sleep for 10 msec while the trans is NOT running, then to sleep for 100 msec while the trans IS running. This is a classic no-no for concurrency but it's in there to get things working in the happy-path case. This turns runTrans() into a synchronous call, which is fine for my purposes in the console.

Using the database() and runTrans() methods, I was able to automate the testing of a single transformation using two separate database connections:

One thing to note is that I'm changing the step metadata using "back-door" API calls, so the transformation graph is not marked as "dirty". If for some reason you need to save the changes you should be able to call some form of Spoon.save API method.

The updated code and downloads are on GitHub. Have fun!

Wednesday, October 10, 2012

Getting Groovy with PDI

I'm a huge fan of Pentaho Data Integration and also of the Groovy programming language. Since PDI is written in Java and Groovy and Java are so tightly integrated, it seems only natural to bring some Groovy-ness into PDI :)

Some Groovy support already exists in the form of the Script step in the Experimental category. This step allows any language that supports JSR-223 to be used in scripting a PDI step. Also I believe there is some Groovy support in the Data Mining product in the Pentaho suite.

What I wanted was to offer a Groovy console to the Spoon user where he/she could interact with the PDI environment. It turns out that Groovy's groovy.ui.Console class provides this capability. So I created a Spoon plugin with a Tools menu overlay offering the Groovy console:

And before it opens (the first time takes a while as it loads all the classes it needs including Swing), it adds bindings to the singletons in the PDI environment:

- spoon: this variable is bound to Spoon.getInstance(), and thus offers the full Spoon API such as executeTransformation(), getActiveDatabases(), etc.

- pluginRegistry: this variable is bound to PluginRegistry.getInstance(), and offers the Plugin Registry API, such as findPluginWithName(), etc.

- kettleVFS: this variable is bound to KettleVFS.getInstance(), and offers such methods as getTextFileContent(), etc.

- slaveConnectionManager: this variable is bound to SlaveConnectionManager.getInstance(), and offers such methods as createHttpClient

In addition to the above singletons, I also added the following variables that are resolved when the console comes up:

- trans: This variable is resolved to Spoon.getInstance().getActiveTransformation()

- defaultVarMap: This variable is resolved to KettleVariablesList.getInstance().getDefaultValueMap()

- defaultVarDescMap: This variable is resolved to KettleVariablesList.getInstance().getDescriptionMap()

- transGraph: This variable is resolved to Spoon.getInstance().getActiveTransGraph());

Then I added a few helper methods:

- methods(Object o): This method returns o.getClass().getDeclaredMethods()

- printMethods(Object o): This method will print a line with information for each of the methods obtained from calling methods() on the specified object

- props(Object o): This is an alias for the properties() method

- properties(Object o): This method returns o.getClass().getDeclaredFields()

- printProperties: This method will print a line with information for each of the fields obtained from calling properties() on the specified object

- activeTrans: This method calls Spoon.getInstance().getActiveTransformation(), and is thus more up-to-date than using the "trans" variable.

- activeTransGraph: This method calls Spoon.getInstance().getActiveTransGraph(), and is thus more up-to-date than using the "transGraph" variable.

- database(String dbName): This method returns the DatabaseMeta object for the database connection with the specified name. It searches the active databases (retrieved from Spoon.getInstance().getActiveDatabases())

- step(String stepName): This method returns the StepMeta object for the step in the active transformation with the specified name.

- createdb(Map args): This method takes named parameters (passed to Java as a Map) of name, dbType, dbName, host, port, user, and password, and creates and returns a DatabaseMeta object

Once all the bindings are done, the console is displayed:

This allows a Groovy way of doing inspection of the PDI environment, but since almost all of the API is exposed in one way or another, you can also edit metadata. For example, in the above transformation I'd like to run against two different database connections. Without this plugin you run the transformation manually against one database connection, then edit the steps to switch the connection, then run the transformation again. With the Groovy Console you can automate this:

Right now I'm having PermGen memory issues when running transformations, so until I get that figured out the console is probably best used for looking at and editing metadata rather than executing "power methods" like executeTransformation().

Besides swapping database connections, you can also create a new one using the createdb() helper method. Here's an example of this being done, as well as testing the connection and adding it to the active transformation:

Using the API, you should also be able to create new steps, transformations, etc. I'd be interested to hear any uses you come up with, so please comment here if and how you've used this plugin.

The code and downloads are up on GitHub, just unzip the folder into plugins/spoon. Cheers!

UPDATE: It appears that in PDI 4.3.0 there is an older version of the Groovy JAR in libext/reporting, which causes the console plugin not to work. You can try replacing that JAR with the one included in the plugin, but I'm not sure if the reporting stuff will continue to work. In the upcoming release, the version of the Groovy JAR has been updated to 1.8.0.

Sunday, September 30, 2012

Bucket Partitioner plugin

Lately I've been fooling around with the various PDI/Kettle plugin types while reading Matt Casters et al. book "Pentaho Kettle Solutions", and I reached the section on partitioners, which intrigued me. Around the same time, someone had asked a StackOverflow and/or Pentaho Forums question on how to get at certain data from a "flat" XML file, meaning the data he was interested in was not contained in a parent tag, rather it was a certain number of fields AFTER a particular tag.

The solution I proposed was to use the row number to create a sort of "bucket", where the first N rows would be in the first bucket, the next N in the second bucket, and so on. Then it occurred to me that this is just a different form of the Mod Partitioner, except you use the quotient of the division (over the number of partitions) rather than the remainder.

This seemed like an easy thing to implement since all the infrastructure code was already done by the ModPartitioner classes. So I copied the source files into a new project, using a @PartitionerPlugin annotation on my BucketPartitioner class (the ModPartitioner, being part of Kettle core, has its plugin attributes located in the kettle-partition-plugins.xml file).

All I had to change was to add a "/ nrPartitions" expression to the getPartition() code :)

Anyway, the code is up on GitHub (with the plugin JAR), and when I ran it against the sample customers-100.txt CSV file into partitioned Text File Output steps:

I got the results I desired for a partition size of 4, with the first 4 rows (starting with row number 0, so the first "group" only had 3 rows in the bucket versus the usual 4) in file-A.txt, the second 4 rows in file-B.txt, etc. I left the (% nrPartitions) expression in (as suggested by the authors) so that row 16 would be in bucket A and so on.

Now that I got my feet wet with partitioner plugins, I'm thinking about the following partitioners:

- A Number Range partitioner (much like -- and re-using code from -- the Number Range step)
- A String Difference partitioner (same algorithms available in Fuzzy Match & Calculator steps)
- An XPath-based partitioner?
- A dynamic partitioner of some kind

As always, I welcome all comments, questions, and suggestions :) Cheers!

Thursday, September 20, 2012

Creating a reproduction transformation using UDJC

I got some pretty good feedback about my previous UDJC step that creates a Data Grid model of a Kettle stream, but I noticed right away that the process for using it is a bit long. You have to paste in the step, wire it into the trans, run the trans, remove the steps before the UDJC, then wire the Data Grid in place of the UDJC, delete the UDJC, and save the trans with a different name.

The Kettle API provides all of the above operations, so I thought I should upgrade the step to create a new transformation with the Data Grid step as the source, connected to the rest of the transformation (after the UDJC step). Along the way I also found some issues with threading and metadata in the previous UDJC step, so I fixed them in the new one.

The code for the new "Create Reproduction Transformation" UDJC step is located on Gist here. You "install" it the same way as the previous version, either by pasting the code into a new UDJC step, or by pasting it in from a transformation that already contains the filled-in UDJC step. Then you wire it up in the stream:

Then run the transformation. The transformation runs as if the step weren't there (it passes through all data), but a new transformation is created, containing the Data Grid step hooked up to the "rest of the transformation":

Looking at the Data Grid step, you can see the data from the stream:

One thing to remember is that the step the Data Grid is wired to might have referenced the previous step(s). Because the metadata is different for each step, I couldn't interrogate the step to see if it referenced previous steps, and thus I couldn't change the setting. This means if the step after the Data Grid references previous steps (like Merge Rows for example), you will have to edit it to reference the Data Grid instead.

Anyway, I hope this helps! If you try it out, please let me know how/if it works, if it's useful, and if you'd like to see any other features. Cheers!

Tuesday, September 18, 2012

Create a Data Grid from a PDI step's output

Sometimes when folks write Jira cases against PDI, they post sample transformations that reproduce the problem, which can be very helpful. However, often the transformation references a database that PDI developers can't access, which renders the transformation not-so-helpful for debugging and reproduction purposes. We usually ask that the case writer create a Data Grid step or something that will recreate the offending data without requiring a database connection. This can get complex and may be beyond the skills of some users.

To that end, I wrote a User Defined Java Class (UDJC) that will read in rows from a step and create a Data Grid step in the transformation that contains the meta and data necessary to recreate the problem. The user can paste the UDJC step into their transformation, insert it after the step they wish to model, then run the transformation. At the end, a Data Grid step will be added to the transformation, and the user can substitute it into the transformation in place of the step as a row provider.

As an example, let's use the Get Step Plugins transformation from my previous post:

I can create a UDJC step and paste in the code (actually, I keep a transformation handy that contains only this step so I can easily cut-n-paste into transformations), then add it to the stream after Get Step Plugins:

After running or previewing the transformation, the Data Grid Step is created:

Then I insert the Data Grid after the Create Data Grid UDJC and remove the earlier steps:

Previewing the Input Steps step, I get the same results as I did with the Get Step Plugins source!

I added Undo support so you can use Ctrl-Z to undo the addition of the Data Grid step if you wish. The code for the UDJC is located on Gist here. I've already got ideas on how to make the "creating a reproduction transformation" process even easier, but I think this is a decent first step towards that. Let me know what you think!

Friday, September 14, 2012

UDJC as Input step

Most times I see examples (including my own) of the User Defined Java Class (UDJC) PDI step, I see a Generate Rows step in front of it, to act as a "trigger" to get things moving, as UDJC is a Scripting step, not a proper Input step. However, this is not a forced characteristic, meaning you can write a UDJC step that acts as an Input generator step!

The key to doing this is to not read in any rows using getRow(), and thus not to check for a null row. You can assume that processRow will be called once, and you ensure it will ONLY be called once by calling setOutputDone() at the end of processRow(), followed by returning false. The "Input Generator" version of the UDJC step I wrote to get all PDI Step Plugins (from my previous post) is on Gist here. Here's a screenshot of an example using the Input version to filter on Input plugins:

Look ma, no Generate Rows! :)

Get Step Plugins UDJC

I like the Step Information dialog in the Help menu of Spoon, it lists all the information about the loaded step plugins. It occurred to me that it might be nice to know this info while running a job or transformation, and while looking through the code, it turned out to be pretty easy.

To do this, I created a User-Defined Java Class (UDJC) step with the usual Generate Rows step as a trigger:

The code for the UDJC step is on Gist here. The results are similar to the Step Information dialog:

This code can easily be adapted to fetch all plugins, or even to specify the type(s) on the incoming stream.

Thursday, September 13, 2012

Parent-Child lookup

I'm writing this post in response to this question at StackOverflow (and the Pentaho forum post here), unfortunately as I am a new member at Stack Overflow they won't let me post images :-/

The question was:

I have a source table (from a MS DB) that contains a field "ParentID", which references the PK of "ID"

The target Postgres table is filled (new "ID" PK inserted from sequence) and the "ParentID" is set to null. The stream has a the new PK inserted within it.

So now at this stage i have a valid stream of

"ID","ParentID", "NewTargetID"
100 NULL 1
101 100 2

How do i do a "Stream Lookup" (or maybe something else) and lookup the values to create a new stream field to give me a new field of "NewParentID"

"ID","ParentID","NewTargetID","NewParentID"
100 NULL 1 NULL -< (How do i generate this ?)
101 100 2 1 -< (How do i generate this ?)

Thanks

David

My response is:

One way is to have two copies of the data, so you can use one as the lookup stream and one as the data stream:

The "Lookup by parent" step looks like this:

I ran this with a Data Grid step containing the values from the OP, and got these results:

I'll try to get this example transformation up on Box or our Wiki.

Friday, September 7, 2012

PDI Pseudo-step: Google Docs Fusion Table Input

As far as I can tell, the current Google Docs plugin for PDI works on Spreadsheets and perhaps other Google Docs data sources, but not Fusion Tables (which at the time of this writing are currently in beta). I whipped up a quick User-Defined Java Class (UDJC) step, the code is on Gist here.

A couple of notes:

The output fields (that I'm grabbing from the Fusion Table) are hardcoded in there right now
I didn't use the OAuth method of access, the username and password (along with table id) should be passed as input fields to the UDJC step.
The table id is available in the URL when you open the table in Google Docs.
You need the Google Docs plugin installed in PDI for this step to work.
Someday I'd like to integrate this with the real PDI Google Docs plugin :)

ANTLR Recognizer step plugin (v1)

While reading "Pentaho Kettle Solutions" Chapter 20 ("Handling Complex Data Formats"), Matt Casters et al pointed out that some data is semi-structured or unstructured, and that the rules for defining the structure "may be very complex and hard to formulate". They go on to offer regular expressions as a solution, but note that some regular expressions (such as the one for RFC 2822 email addresses) are "quite convoluted".

Fortunately there are better ways to define grammars for complicated data formats, and my favorite way is ANTLR. Grammars are specified in a BNF-like format, and ANTLR will generate Java source code for the lexer and parser(s). There is also a slick IDE for ANTLR grammars called ANTLRWorks, both tools are released under the BSD license.

Given the user-friendliness and Java-friendliness of these tools, I thought to make a validator step in Pentaho Data Integration where you can specify the ANTLR grammar file and it will generate the Java source for the grammar, then compile it and load it for use in validating content provided in the PDI stream:

Originally I wanted to use Janino to compile the Java source, but this posed two problems related to how ANTLR generates Java source. First, ANTLR inserts @SuppressWarnings annotations, which are not supported by Janino. I had code to remove these lines, but then I ran into a Janino bug that prevents the compilation of ANTLR-generated Java source files. For this reason I switched to javac for compilation. However, this requires that you have the JDK installed and the javac command in your path.

Once the plugin is installed (by extracting the ANTLRRecognizer folder into the plugins/steps folder under the PDI location, you can add the step from the Transform group. Double-clicking on the step brings up the configuration dialog, which looks like the following:

Here I have chosen to validate CSV files by using the grammar here, providing the content on the stream, the entry rule name, and the result fieldname. Currently the result field will contain a 'Y' or 'N' character whether the content was valid or not, respectively.

Unfortunately ANTLR does not throw exceptions as default error handling. For that reason, each grammar needs to override the reportError method to throw an AntlrParseException object. The code to be added to the grammar looks like:

@header {
import org.pentaho.di.trans.steps.antlrrecognizer.AntlrParseException;
}

@members {

     @Override

     public void reportError(RecognitionException e) {

          throw new AntlrParseException(e.getMessage());

}

}

The AntlrParseException class is provided with the ANTLRRecognizer.jar in the plugin folder. This exception will be caught by the ANTLR Recognizer step and the result field will contain a 'N'.

Notes

The Java source and class files are only generated when the grammar file has been modified since the last time the source and bytecode were generated. See the Future Work section for planned enhancements.

Many line-based grammars have a rule to parse a line and a rule that parses every line in the input. Using the Rule name field, the ANTLR Recognizer step can either validate the whole input (if you are loading content from a file, for example), or line-by-line (if the lines are rows on the stream). If the grammar has both rules, it does not need to be regenerated to use either rule.

Future Work

Having a validator is all well and good, but if you need to get at specific data in complex structures, you still need regular expressions with grouping or some complicated solution. I'm planning on specifying parser patterns and/or providing helper classes such that the ANTLR Recognizer can put recognized rules out on the stream. This would work similarly to the XPath functionality in XML steps, although any hierarchy would have to be baked into the grammar, as a rule matcher doesn't necessarily know what parent rule it is a part of. I may suggest or insist that this is the case, in order for a future version of the step to be more flexible.

If/When Janino 2.6.2 is released, I will try to switch to Janino in order to remove the JDK dependency.

It should be allowed to select a grammar that has already been compiled, rather than the grammar definition from a file. The step could search the usual PDI directories looking for JARs that contain classes that extend the ANTLR lexer and parser classes. The user should be able to select either a filename of an ANTLR grammar file (*.g) or from a dropdown of recognized grammars, and possibly even from a filename pointing at a grammar JAR that is not in the searched PDI directories.

Downloads

The project, including source and the plugin package (ready to drop into your PDI instance) are located on GitHub here.

Thursday, September 6, 2012

Getting Connection information on the stream

Sometimes I need access to the database connection information inside a PDI transformation. I can use a Java Script scripting step to fetch the metadata for all database connections and put them on the stream. Since the Java Script step is not an input step, I'll need a Generate Rows step to act as a trigger. It will have a Limit of 1 and no fields:

The JavaScript step looks like the following:

The code is on PasteBin here.

NOTE: Make sure the Compatibility mode checkbox is not checked (see above image).

This can be extended to fetch other information such as username, tablespace, etc. Alternatively, you can hop to Roland's transformation to grab all JDBC metadata.

Hello!

Hi there, I'm Matt Burgess, a Senior Java Developer for Pentaho Corporation. I mainly work on the Pentaho Data Integration (PDI) product, also known by its community name Kettle.

Sometimes I have a few side projects, pro tips, and cool features to showcase, so I created this blog to share my ideas, examples, etc. I welcome all feedback, suggestions, improvements, etc. so don't be shy :) Enjoy!

Fun with Pentaho Data Integration