Fun with Pentaho Data Integration: September 2012

Sunday, September 30, 2012

Bucket Partitioner plugin

Lately I've been fooling around with the various PDI/Kettle plugin types while reading Matt Casters et al. book "Pentaho Kettle Solutions", and I reached the section on partitioners, which intrigued me. Around the same time, someone had asked a StackOverflow and/or Pentaho Forums question on how to get at certain data from a "flat" XML file, meaning the data he was interested in was not contained in a parent tag, rather it was a certain number of fields AFTER a particular tag.

The solution I proposed was to use the row number to create a sort of "bucket", where the first N rows would be in the first bucket, the next N in the second bucket, and so on. Then it occurred to me that this is just a different form of the Mod Partitioner, except you use the quotient of the division (over the number of partitions) rather than the remainder.

This seemed like an easy thing to implement since all the infrastructure code was already done by the ModPartitioner classes. So I copied the source files into a new project, using a @PartitionerPlugin annotation on my BucketPartitioner class (the ModPartitioner, being part of Kettle core, has its plugin attributes located in the kettle-partition-plugins.xml file).

All I had to change was to add a "/ nrPartitions" expression to the getPartition() code :)

Anyway, the code is up on GitHub (with the plugin JAR), and when I ran it against the sample customers-100.txt CSV file into partitioned Text File Output steps:

I got the results I desired for a partition size of 4, with the first 4 rows (starting with row number 0, so the first "group" only had 3 rows in the bucket versus the usual 4) in file-A.txt, the second 4 rows in file-B.txt, etc. I left the (% nrPartitions) expression in (as suggested by the authors) so that row 16 would be in bucket A and so on.

Now that I got my feet wet with partitioner plugins, I'm thinking about the following partitioners:

- A Number Range partitioner (much like -- and re-using code from -- the Number Range step)
- A String Difference partitioner (same algorithms available in Fuzzy Match & Calculator steps)
- An XPath-based partitioner?
- A dynamic partitioner of some kind

As always, I welcome all comments, questions, and suggestions :) Cheers!

Thursday, September 20, 2012

Creating a reproduction transformation using UDJC

I got some pretty good feedback about my previous UDJC step that creates a Data Grid model of a Kettle stream, but I noticed right away that the process for using it is a bit long. You have to paste in the step, wire it into the trans, run the trans, remove the steps before the UDJC, then wire the Data Grid in place of the UDJC, delete the UDJC, and save the trans with a different name.

The Kettle API provides all of the above operations, so I thought I should upgrade the step to create a new transformation with the Data Grid step as the source, connected to the rest of the transformation (after the UDJC step). Along the way I also found some issues with threading and metadata in the previous UDJC step, so I fixed them in the new one.

The code for the new "Create Reproduction Transformation" UDJC step is located on Gist here. You "install" it the same way as the previous version, either by pasting the code into a new UDJC step, or by pasting it in from a transformation that already contains the filled-in UDJC step. Then you wire it up in the stream:

Then run the transformation. The transformation runs as if the step weren't there (it passes through all data), but a new transformation is created, containing the Data Grid step hooked up to the "rest of the transformation":

Looking at the Data Grid step, you can see the data from the stream:

One thing to remember is that the step the Data Grid is wired to might have referenced the previous step(s). Because the metadata is different for each step, I couldn't interrogate the step to see if it referenced previous steps, and thus I couldn't change the setting. This means if the step after the Data Grid references previous steps (like Merge Rows for example), you will have to edit it to reference the Data Grid instead.

Anyway, I hope this helps! If you try it out, please let me know how/if it works, if it's useful, and if you'd like to see any other features. Cheers!

Tuesday, September 18, 2012

Create a Data Grid from a PDI step's output

Sometimes when folks write Jira cases against PDI, they post sample transformations that reproduce the problem, which can be very helpful. However, often the transformation references a database that PDI developers can't access, which renders the transformation not-so-helpful for debugging and reproduction purposes. We usually ask that the case writer create a Data Grid step or something that will recreate the offending data without requiring a database connection. This can get complex and may be beyond the skills of some users.

To that end, I wrote a User Defined Java Class (UDJC) that will read in rows from a step and create a Data Grid step in the transformation that contains the meta and data necessary to recreate the problem. The user can paste the UDJC step into their transformation, insert it after the step they wish to model, then run the transformation. At the end, a Data Grid step will be added to the transformation, and the user can substitute it into the transformation in place of the step as a row provider.

As an example, let's use the Get Step Plugins transformation from my previous post:

I can create a UDJC step and paste in the code (actually, I keep a transformation handy that contains only this step so I can easily cut-n-paste into transformations), then add it to the stream after Get Step Plugins:

After running or previewing the transformation, the Data Grid Step is created:

Then I insert the Data Grid after the Create Data Grid UDJC and remove the earlier steps:

Previewing the Input Steps step, I get the same results as I did with the Get Step Plugins source!

I added Undo support so you can use Ctrl-Z to undo the addition of the Data Grid step if you wish. The code for the UDJC is located on Gist here. I've already got ideas on how to make the "creating a reproduction transformation" process even easier, but I think this is a decent first step towards that. Let me know what you think!

Friday, September 14, 2012

UDJC as Input step

Most times I see examples (including my own) of the User Defined Java Class (UDJC) PDI step, I see a Generate Rows step in front of it, to act as a "trigger" to get things moving, as UDJC is a Scripting step, not a proper Input step. However, this is not a forced characteristic, meaning you can write a UDJC step that acts as an Input generator step!

The key to doing this is to not read in any rows using getRow(), and thus not to check for a null row. You can assume that processRow will be called once, and you ensure it will ONLY be called once by calling setOutputDone() at the end of processRow(), followed by returning false. The "Input Generator" version of the UDJC step I wrote to get all PDI Step Plugins (from my previous post) is on Gist here. Here's a screenshot of an example using the Input version to filter on Input plugins:

Look ma, no Generate Rows! :)

Get Step Plugins UDJC

I like the Step Information dialog in the Help menu of Spoon, it lists all the information about the loaded step plugins. It occurred to me that it might be nice to know this info while running a job or transformation, and while looking through the code, it turned out to be pretty easy.

To do this, I created a User-Defined Java Class (UDJC) step with the usual Generate Rows step as a trigger:

The code for the UDJC step is on Gist here. The results are similar to the Step Information dialog:

This code can easily be adapted to fetch all plugins, or even to specify the type(s) on the incoming stream.

Thursday, September 13, 2012

Parent-Child lookup

I'm writing this post in response to this question at StackOverflow (and the Pentaho forum post here), unfortunately as I am a new member at Stack Overflow they won't let me post images :-/

The question was:

I have a source table (from a MS DB) that contains a field "ParentID", which references the PK of "ID"

The target Postgres table is filled (new "ID" PK inserted from sequence) and the "ParentID" is set to null. The stream has a the new PK inserted within it.

So now at this stage i have a valid stream of

"ID","ParentID", "NewTargetID"
100 NULL 1
101 100 2

How do i do a "Stream Lookup" (or maybe something else) and lookup the values to create a new stream field to give me a new field of "NewParentID"

"ID","ParentID","NewTargetID","NewParentID"
100 NULL 1 NULL -< (How do i generate this ?)
101 100 2 1 -< (How do i generate this ?)

Thanks

David

My response is:

One way is to have two copies of the data, so you can use one as the lookup stream and one as the data stream:

The "Lookup by parent" step looks like this:

I ran this with a Data Grid step containing the values from the OP, and got these results:

I'll try to get this example transformation up on Box or our Wiki.

Friday, September 7, 2012

PDI Pseudo-step: Google Docs Fusion Table Input

As far as I can tell, the current Google Docs plugin for PDI works on Spreadsheets and perhaps other Google Docs data sources, but not Fusion Tables (which at the time of this writing are currently in beta). I whipped up a quick User-Defined Java Class (UDJC) step, the code is on Gist here.

A couple of notes:

The output fields (that I'm grabbing from the Fusion Table) are hardcoded in there right now
I didn't use the OAuth method of access, the username and password (along with table id) should be passed as input fields to the UDJC step.
The table id is available in the URL when you open the table in Google Docs.
You need the Google Docs plugin installed in PDI for this step to work.
Someday I'd like to integrate this with the real PDI Google Docs plugin :)

ANTLR Recognizer step plugin (v1)

While reading "Pentaho Kettle Solutions" Chapter 20 ("Handling Complex Data Formats"), Matt Casters et al pointed out that some data is semi-structured or unstructured, and that the rules for defining the structure "may be very complex and hard to formulate". They go on to offer regular expressions as a solution, but note that some regular expressions (such as the one for RFC 2822 email addresses) are "quite convoluted".

Fortunately there are better ways to define grammars for complicated data formats, and my favorite way is ANTLR. Grammars are specified in a BNF-like format, and ANTLR will generate Java source code for the lexer and parser(s). There is also a slick IDE for ANTLR grammars called ANTLRWorks, both tools are released under the BSD license.

Given the user-friendliness and Java-friendliness of these tools, I thought to make a validator step in Pentaho Data Integration where you can specify the ANTLR grammar file and it will generate the Java source for the grammar, then compile it and load it for use in validating content provided in the PDI stream:

Originally I wanted to use Janino to compile the Java source, but this posed two problems related to how ANTLR generates Java source. First, ANTLR inserts @SuppressWarnings annotations, which are not supported by Janino. I had code to remove these lines, but then I ran into a Janino bug that prevents the compilation of ANTLR-generated Java source files. For this reason I switched to javac for compilation. However, this requires that you have the JDK installed and the javac command in your path.

Once the plugin is installed (by extracting the ANTLRRecognizer folder into the plugins/steps folder under the PDI location, you can add the step from the Transform group. Double-clicking on the step brings up the configuration dialog, which looks like the following:

Here I have chosen to validate CSV files by using the grammar here, providing the content on the stream, the entry rule name, and the result fieldname. Currently the result field will contain a 'Y' or 'N' character whether the content was valid or not, respectively.

Unfortunately ANTLR does not throw exceptions as default error handling. For that reason, each grammar needs to override the reportError method to throw an AntlrParseException object. The code to be added to the grammar looks like:

@header {
import org.pentaho.di.trans.steps.antlrrecognizer.AntlrParseException;
}

@members {

     @Override

     public void reportError(RecognitionException e) {

          throw new AntlrParseException(e.getMessage());

}

}

The AntlrParseException class is provided with the ANTLRRecognizer.jar in the plugin folder. This exception will be caught by the ANTLR Recognizer step and the result field will contain a 'N'.

Notes

The Java source and class files are only generated when the grammar file has been modified since the last time the source and bytecode were generated. See the Future Work section for planned enhancements.

Many line-based grammars have a rule to parse a line and a rule that parses every line in the input. Using the Rule name field, the ANTLR Recognizer step can either validate the whole input (if you are loading content from a file, for example), or line-by-line (if the lines are rows on the stream). If the grammar has both rules, it does not need to be regenerated to use either rule.

Future Work

Having a validator is all well and good, but if you need to get at specific data in complex structures, you still need regular expressions with grouping or some complicated solution. I'm planning on specifying parser patterns and/or providing helper classes such that the ANTLR Recognizer can put recognized rules out on the stream. This would work similarly to the XPath functionality in XML steps, although any hierarchy would have to be baked into the grammar, as a rule matcher doesn't necessarily know what parent rule it is a part of. I may suggest or insist that this is the case, in order for a future version of the step to be more flexible.

If/When Janino 2.6.2 is released, I will try to switch to Janino in order to remove the JDK dependency.

It should be allowed to select a grammar that has already been compiled, rather than the grammar definition from a file. The step could search the usual PDI directories looking for JARs that contain classes that extend the ANTLR lexer and parser classes. The user should be able to select either a filename of an ANTLR grammar file (*.g) or from a dropdown of recognized grammars, and possibly even from a filename pointing at a grammar JAR that is not in the searched PDI directories.

Downloads

The project, including source and the plugin package (ready to drop into your PDI instance) are located on GitHub here.

Thursday, September 6, 2012

Getting Connection information on the stream

Sometimes I need access to the database connection information inside a PDI transformation. I can use a Java Script scripting step to fetch the metadata for all database connections and put them on the stream. Since the Java Script step is not an input step, I'll need a Generate Rows step to act as a trigger. It will have a Limit of 1 and no fields:

The JavaScript step looks like the following:

The code is on PasteBin here.

NOTE: Make sure the Compatibility mode checkbox is not checked (see above image).

This can be extended to fetch other information such as username, tablespace, etc. Alternatively, you can hop to Roland's transformation to grab all JDBC metadata.

Hello!

Hi there, I'm Matt Burgess, a Senior Java Developer for Pentaho Corporation. I mainly work on the Pentaho Data Integration (PDI) product, also known by its community name Kettle.

Sometimes I have a few side projects, pro tips, and cool features to showcase, so I created this blog to share my ideas, examples, etc. I welcome all feedback, suggestions, improvements, etc. so don't be shy :) Enjoy!

Fun with Pentaho Data Integration