Fun with Pentaho Data Integration: 2014

Tuesday, December 30, 2014

SuperScript PDI plugin

As readers of my blog know, I'm a huge fan of scripting languages on the JVM (especially Groovy), and of course I'm a huge fan of Pentaho Data Integration :) While using the (experimental) Script step to do various things, I saw a few places where a script step could be improved for easier use and more powerful features. Specifically I wanted:

A drop-down UI for selecting the scripting engine
Allow non-compilable JSR-223 scripting (such as AppleScript)
Enable the use of the script's returned value as an output field
Enable the use of the script step as an input step (doesn't need a trigger)

To that end, I set out to refactor the existing Script step, and I'm happy to announce that the SuperScript step is now available in the PDI Marketplace:

As you can see from the screenshot above, I get a list of the output fields, which are the input fields plus the fields I specified in the table at the bottom of the dialog. Also notice that I did not define the variable/field "c" in the script, but I set "Script result?" to Y for "c", and thus "c" will contain the result of the script evaluation (in this case, b * rowNumber).

The following variables are available to each script:

step - A reference to the SuperScript step object
stepName - the name of the SuperScript step
transName - the name of the transformation
row - the current row's data
lastRow - the last row's data
rowMeta - the metadata about the rows (field types, e.g.)
rowNumber (starts with 1 like the rest of PDI)
SKIP_TRANSFORMATION
ABORT_TRANSFORMATION
ERROR_TRANSFORMATION
CONTINUE_TRANSFORMATION

These last 4 function the same way as they do in the Script step. Users of the Script (or Modified JavaScript) step will notice that I removed leading underscores from some of the variables, this is to support script engines that don't support leading underscores in variable names.

A noticeable addition is the "lastRow" variable, this will contain null (or be undefined) for the first row but will contain the previous row's data for all subsequent rows. This opens the door for more powerful processing, such as filling empty fields with the previous row's value, changing script behavior based on if a field value has changed since the last row, etc. UPDATE: Here is a screenshot of an example script that will fill the field (if null) with the previous field's value (if not null):

Other helpful API calls include the following:

step.getTrans() (or just step.trans in Groovy) - gets a reference to the Transformation object

step.getTrans().findStepInterface("another step") - Gets a reference to a different step

step.putRow(rowMeta, rowData) - Adds a row to the output stream

As I mentioned before, the existing Script step must be "primed" so that it will run at least once. Usually this is done with a Generate Rows step that puts out 1 or more rows. Sometimes the script itself will generate rows, so I wanted SuperScript to run at least once, whether there was an incoming row or not:

Perhaps the most fun and powerful addition is the ability of SuperScript to execute any JSR-223 Script Engine. The existing Script step requires that the Script Engine produce CompiledScript(s), which of course is the fastest but not always available. To that end, SuperScript will attempt to compile the script first, and if it cannot, it will fall back to evaluating (i.e. interpreting) the script(s). This opens the door for a lot of new scripting languages, such as Jython, AppleScript, and Renjin (an R ScriptEngine for the JVM).

To illustrate this, and to have "Fun with Pentaho Data Integration", I created a transformation with two SuperScripts running AppleScript, one just after a Generate Rows step, and one just before the last step. The first AppleScript starts iTunes, and the last one quits it. For long-running transformations, this pattern can be used to provide some musical entertainment while you wait for the transformation to finish :)

NOTE: The AppleScript script engine is afaik only available on the Mac and comes with its JDK.

I hope you find this plugin helpful, and I'd love to know how/if you are using it, and also how to improve it. One improvement I hope to add is the ability to choose different plugin(s) to link to, in order to use their capabilities. This would likely be done by creating a chained self-first classloader, and could be useful for things like Big Data capabilities from inside the SuperScript step.

The code is Apache 2.0-licensed and available on GitHub at https://github.com/mattyb149/pdi-scriptengine-plugin

As always, I welcome all comments, questions, and suggestions. Until next time, have fun with Pentaho Data Integration :)

Cheers!

Monday, December 8, 2014

How sorted (or sordid) is your data?

I've spent quite a bit of time looking at Pentaho Data Integration (aka Kettle) and trying to make it do things with external technologies and idioms, anywhere from Groovy, Drill, memcached, Redis, Hazelcast, and even Markov Chains. Recently though, I've been started to focus on the data coming in and out of PDI, and what I could learn from it (#datadiscovery). I'll be spending a lot more time with R and Drill soon, but as a small example of data discovery, I thought I'd look at "how sorted" data is.

Basically I wanted to know for an input stream (consisting of CSV files or database tables or whatever), is the stream close to being in a sorted state or not? I am currently looking into approximate and probabilistic methods (like Longest Increasing Sequence and an interesting "multiplayer" version here), but this post is about a brute-force method of finding the variance of distance between an element in a stream and where it would be if the stream were sorted.

Specifically, I looked at the rank (aka row number) of the incoming data as the row number of the raw input, then in parallel I sorted on the desired columns and ranked the sorted rows. I was looking for the distance between each row's value(s) and how far the rows were from their sorted position(s). My research (read: Google search and Wikipedia) brought me to the Spearman rank correlation coefficient.

For this I would need to sort the rows, then find the delta between the position of each desired column in the original rowset and the sorted rowset, then find the statistical dependence of the ranked values. There are more sophisticated techniques to determine the relationships between ranked items, but this one suited my purpose :)

Once I found the algorithm I was looking for, I set out to create an example using only PDI steps, with the following caveats:

1) No scripting steps: Of course you can do whatever you like with the Scripting steps, but if you don't know those programming languages, you're left with the rest of what PDI offers. Luckily the choices are plentiful and powerful.

2) No SQL steps: Most databases probably offer the kind of expressive power you'd need to write a "Spearman rho" function inline, and to be honest that's probably the best option performance-wise; but I was looking to create a data-agnostic, language-agnostic way to calculate the "sortedness" of a data set in PDI, as this could be used in a blending or refinery situation.

I decided to use the "customer-100.txt" sample file in PDI, and sorted on full name, in order to determine "how sorted" the customer data was in terms of customers' names. I designed the following transformation:

This transformation is on Gist and Box. The results:

The absolute value of the Spearman rho for customer-100.txt (when sorted on name) is 0.001272127. I used absolute value because I didn't care whether the stream was close to being sorted or reverse-sorted; if you care about that in your usage, then leave out the ABS(rho) calculation in the "Spearman's rho" step above.

Being so close to zero, we can determine that the data is not very well sorted, as a result of the Spearman rho telling us that there is no tendency for the raw data and the sorted data to follow any sort of trend of monotonicity (ascending or descending). If the values were to get closer to 1 (or to -1 if not using absolute value), then the stream would be closer to its sorted state and thus "more sorted". I set up a rudimentary Value Mapper step ("How sorted is the data?" above) to indicate whether the data was well-sorted or not. If you disable the sort path and enable the direct path around it, then the two rowsets will match and you will get 1.0 as the Spearman rho.

This might not be very useful to the PDI user-at-large, but I learned alot while working through it so I thought I'd share it here. Stay tuned for more Fun with Pentaho Data Integration ;)

Monday, November 10, 2014

ZooKeeper Input and Output steps in PDI

While working with Apache Drill and PDI (see previous posts), I found myself needing to read and write values to and from Drill's ZooKeeper instance. Since ZooKeeper can be (and is) used for many other applications besides Drill, I thought I'd write some simple ZooKeeper steps for PDI, namely ZooKeeper Input and ZooKeeper Output. Also I thought it would be nice to be able to view and edit values in my ZooKeeper instance while designing transformations, so I integrated a cool UI called Zooviewer into Spoon.

The ZooKeeper Input step takes paths to ZooKeeper values, and fills those values into the field names you select in the dialog:

The above screenshot shows the ZooKeeper Input dialog, notice that all PDI types are supported, as long as they can represent their values as a byte array.

The ZooKeeper Output step also takes field names and paths, and will recursively create the paths if you check the "Create path(s)?" checkbox.

As of version 1.1 of the plugin (now in the PDI Marketplace), the ZooKeeper Output step also supports variable and field substitution for the Path values, in some pretty cool ways:

1) In the Path column of the Output Fields table, you can use a variable/parameter, such as ${pathParam}

2) You can also use the field-substitution notation, which will inject values from the given field as values for the path(s). This is a little-known feature of PDI and as far as I know has only been implemented in the Mongo plugin and (now) the ZooKeeper Output step. To use this, suppose you have a bunch of key/value pairs on the PDI stream going into the ZooKeeper Output step, where the key is the path where you want the value stored in ZooKeeper. Then you'd set the Path value in the ZooKeeper Output dialog to ?{key}. Notice the question mark instead of the dollar-sign, this indicates a field substitution versus a variable substitution.

3) The ZooKeeper Output step will perform another variable substitution on the field value, in case your field values contain variables. Below is a screenshot showing this use case:

Notice the ZooKeeper Output dialog uses the ?{key} notation to get its path values from the key field in the stream. The key field values include a variable ${pathParam}, which is filled in at runtime (see "Execute a transformation" dialog under Parameters). Running this will create three paths and store 3 values.

To view and edit the values in your ZooKeeper instance from PDI, select Manage ZooKeeper from the Tools drop-down menu. This will bring up a view window and an edit window, where you can create and delete child nodes, change values, etc. Here is a view of my ZooKeeper instance after running the test transformations I showed above:

I'm interested to see if folks find this plugin useful and if so, how they are using it. As always, I welcome all questions, comments, suggestions, and contributions.

Cheers!

Tuesday, October 28, 2014

Scripting Extension Points in PDI

PDI Extension points are an awesome feature added to PDI 5.0 (and updated throughout 5.x) that allow you to hook into the operational aspects of your ETL processes to provide finer-grained control, optimizations, additional auditing/logging, or whatever your heart desires. Extension points abound in the system now, from places such as Transformation Start/Finish, Job Entry start/finish, Mouse Down/Double-click, Carte startup/shutdown, Database connect/disconnect, and the list goes on (see the above link for the current list).

Writing an extension point plugin is already pretty darn easy, a basic template might look like this:

@ExtensionPoint(
  id = "CarteShutdownScript",
  extensionPointId = "CarteShutdown",
  description = "Executes script(s) when a Carte instance is shut down"
)
public class CarteShutdownExecutor implements ExtensionPointInterface {

  public CarteShutdownExecutor() {
    // Do init stuff here
  }

  @Override
  public void callExtensionPoint( LogChannelInterface log, Object o ) throws KettleException {
    WebServer server = (WebServer)o;

    // Do what you want here
  }
}

However, unless you have a reusable project template for your IDE (which is not a bad idea by the way), then building and deploying an extension point plugin may be more difficult than writing one. Also, any code changes require a re-compile and re-deployment. I've written a few of these and wished the whole process, although pretty easy, would be even easier.

With that in mind, I set out to write extension point plugins that take care of the boilerplate code, while still allowing the full expressive power of general-purpose scripting languages. Also I didn't want to pick a particular scripting language (although I prefer Groovy :), so instead I decided to allow any language that provides a JSR-223 compliant script engine. Rhino (JavaScript) and Groovy come with PDI out-of-the-box (and so does AppleScript on a Mac), but I tested a Jython script as well.

To use this capability, go to the PDI Marketplace and install the PDI Extension Point Scripting plugin. This will put the plugin in data-integration/plugins/pdi-script-extension-points. In that folder you will find two examples, TransformationStart.groovy and TransformationFinish.js. If you run a transformation you should see two additional log messages every time you run a transformation. One script is executed with the Groovy script engine, and one with the Rhino engine, respectively.

The convention for these scripts is as follows:

1) The script must be located in the plugins/pdi-script-extension-points folder
2) The name of the script must start with the extension point ID you wish to implement. After that ID you can put whatever you like, my plugin just does a startsWith() to see if it recognizes the ID.
3) The extension of the script must be recognized by a JSR-223 script engine in your classpath. So you can use .js for JavaScript files and .groovy for Groovy files, and if you've added something like Jython you can use .py as the extension
4) Besides any variable bindings provided by the engine, two more are added for your use, namely the two provided by the callExtensionPoint method:

_log_: a reference to the LogChannelInterface object, good for additional logging
_object_: a reference to the context object (Trans for TransformationStart, e.g.) The list is available on the wiki page. With dynamically typed languages you likely don't need to cast the _object_ value to the type listed on the wiki page.

The scripts are reloaded every time the extension point is invoked, so you can make updates to your script, re-run your transformation, and the updates will get pulled in. This allows for the use (and development of) scripting during design time that will be applied at run time.

The included examples are trivial, here's a slightly more involved script called StepAfterInitialize_each.groovy that adds a RowListener to each row for each step:

import org.pentaho.di.trans.step.*

_log_.logBasic "${_object_.combi.stepname} after init..."

rowAdapter = [
  rowReadEvent : { rowMeta, row -> 
     _log_.logBasic "Row size: ${row?.length}"
  }
] as RowAdapter

_object_.combi.step.addRowListener(rowAdapter)

Trying this with the "Delay row - Basic example" sample transformation, I get 10 lines of "Row size: 14". The transformation has two steps, so I might think I should get 20 lines (10 per step) of output, but a RowListener is called when a step reads a row, not outputs a row, so the Generate Rows step does not invoke the RowListeners.

I hope this is a helpful addition to the PDI ecosystem, and if so I'd love to hear about how you've used it, what kinds of crazy things you've tried, and especially how this can be improved. The code is open-source (Apache 2.0 license) on GitHub.

Cheers!

Friday, October 17, 2014

Flatten JSON to key-value pairs in PDI

I've heard a number of comments regarding JSON and PDI, most of them having to do with difficulties parsing nested documents, using JSONPath, etc. Personally, I've had a JSON doc I'd like to fetch fields from but I didn't want to try to figure out the JSONPath or document structure, I just wanted to get right to the values.

To achieve this (and to prove my point from a previous post on using Groovy Grape in PDI), I wrote a Groovy script to flatten a JSON document into key/value pairs in PDI. I needed the following elements:

1) An Apache Ivy installation/JAR. In a previous post I added the Ivy folder to the launcher.properties; for this I just dropped the single JAR into data-integration/lib before starting Spoon.

2) A JSON document read into the PDI stream and passed to a Script step

To use Groovy with the (still experimental) Script step, ensure the step name ends in ".groovy". This indicates to the Script step to find a JSR-223 Scripting Engine with the name "groovy". Since PDI comes with the Groovy script engine, you can use this out-of-the-box. To use other script engines, simply add the JAR(s) to the classpath (I usually drop them in data-integration/lib).

NOTE: The Script step does not find the Groovy script engine with Java 7 on Mac. This is documented in PDI-13074. You can use Java 6 but that is not officially supported by Pentaho for PDI 5.0+

For the Groovy script, I decided to use Jackson Databind to parse the JSON, using an example I found on Stack Overflow (here). Databind is not included with PDI, so I used the technique from my PRD post to @Grab it:

@Grab(group='com.fasterxml.jackson.core', module='jackson-databind', version='2.3.3')

Then I needed the ability to add more than one output row per input row. The script step was designed to operate on a row of data and add fields to that row by setting variables in the script (and specifying those variables/fields in the step dialog). Since I needed one input row to generate multiple output rows (one per JSON scalar object), I created the output row I wanted by explicitly adding the two fields I intended to add:

outputRowMeta = _step_.getInputRowMeta().clone();
_step_.stepMeta.stepMetaInterface.getFields( outputRowMeta, _step_.getStepname(), null, null, _step_, _step_.repository, _step_.metaStore );
outputRowMeta.addValueMeta(new ValueMetaString("key"))
outputRowMeta.addValueMeta(new ValueMetaString("value"))
outputRow = RowDataUtil.resizeArray( row, outputRowMeta.size()+2 )

Note that the Script step does this for you if you have 1 output row for every input row. As a result, I added all but the last JSON scalar, then let the Script step do the last one for me:

int outputIndex = rowMeta.size()
int count = 1
int numProps = map.size()
key = null
value = null
map.each {kv ->
  if(count < numProps) {
    keyIndex = outputIndex
    valueIndex = outputIndex+1
    if(keyIndex >= 0 && valueIndex >= 0) {
      outputRow[keyIndex] = kv.key
      outputRow[valueIndex] = kv.value
    }
    _step_.putRow(outputRowMeta, outputRow)
  }
  else {
    key = kv.key
    value = kv.value
  }
  count++
}

The entire script is a Gist located here, and here's a screenshot of the step dialog:

You can see where I added ".groovy" to the step name, as well as specifying the output fields in the table below (and using them in the else loop above).

I ran the step against the following JSON doc:

{
   "Port":
   {
       "@alias": "defaultHttp",
       "Enabled": "true",
       "Number": "10092",
       "Protocol": "http",
       "KeepAliveTimeout": "20000",
       "ThreadPool":
       {
           "@enabled": "false",
           "Max": "150",
           "ThreadPriority": "5"
       },
       "ExtendedProperties":
       {
           "Property":
           [                         
               {
                   "@name": "connectionTimeout",
                   "$": "20000"
               }
           ]
       }
   }
}

And got the following results:

Perhaps this will be helpful for you, either by using the script to flatten JSON, or as an example of using Groovy in the Script step, and/or using @Grab to get dependencies on-the-fly in PDI.

Cheers!

Wednesday, October 1, 2014

List Zookeeper Nodes and Data with Groovy

Here's a quick Groovy script to recursively list Zookeeper nodes (and optionally, data), also on Gist here. What does this have to do with PDI, you may ask? Stay tuned ;)

@Grab('org.apache.zookeeper:zookeeper:3.4.6')

import org.apache.zookeeper.*
import org.apache.zookeeper.data.*
import org.apache.zookeeper.AsyncCallback.StatCallback
import static org.apache.zookeeper.ZooKeeper.States.*

final int TIMEOUT_MSEC = 5000
final int RETRY_MSEC = 100

def num_retries = 0
def print_data = args?.length > 1 ? Boolean.valueOf(args[1]) : false
def path = args?.length > 0 ? [args[0]] : ['/']
noOpWatcher =  { event -> } as Watcher

listKids = { parentList, level ->
  if(parentList != null) {
    parentList.each { parent ->
      parentPath = parent?.startsWith('/') ? parent : ('/'+parent)
      level.times() {print '  '}
      println parentPath
      dataStat = new Stat()
      try {
        bytes = zk.getData(parentPath, true, dataStat)
        if(dataStat?.dataLength > 0 && bytes && print_data) {
          level.times() {print '  '}
          println new String(bytes)
        }
      }
      catch(e) {}
      try {
        kids = zk.getChildren(parentPath, true)
        if(kids && kids.size() > 0) {
          listKids(kids.collect{parentPath+(parentPath.endsWith('/') ? '' : '/') +it}, level+1)
        }
      }
      catch(e) {}
    }
  }
}
  
zk = new ZooKeeper('localhost:2181', TIMEOUT_MSEC , noOpWatcher)
while( zk.state != CONNECTED && num_retries < (TIMEOUT_MSEC / RETRY_MSEC) ) {
  Thread.sleep(RETRY_MSEC)
  num_retries++
}
if(zk.state != CONNECTED) {
  println "No can do bro, after $TIMEOUT_MSEC ms the status is ${zk.state}"
  //System.exit(1)
}
else {
  listKids(path, 0)
}

zk.close()

Monday, September 22, 2014

Groovy Datasources with Pentaho Report Designer

Ok, so this blog is called "Fun with Pentaho Data Integration", but I recently fielded a question about using scriptable data sources in Pentaho Report Designer (PRD), and rather than start a whole new blog, I thought I'd just post it here. The techniques are generally usable for all Pentaho design tools that support scripting, including PDI :)

I had the following goals with this POC:

1) Instead of just using Groovy to build a TypedTableModel with static values, I wanted to get the data from an external source. I chose a Twitter search with the #Pentaho hashtag.

2) I want to showcase the ability to use the Groovy Grapes/Grab functionality, to minimize/remove the number of JARs needed in the PRD classpath. I was able to get it to zero, with one caveat (see below)

3) I wanted to show off Groovy's idioms and syntactic sugar to show how easy it is to generate a scripted data source for PRD. I got it down to 17 SLOC :)

Goals #1 and #2 led me to Twitter4J, a Java client for the Twitter 1.1 API. In order to use it in a Groovy script, I needed Twitter4J and its dependencies. The most straighforward way to do this is to use Groovy's Grab annotation:

@Grab(group="org.twitter4j", module="twitter4j-core", version="4.0.2")

However, PRD doesn't include Ivy (or at least the parts of Ivy) that Grab needs to resolve dependencies. Rather than add the JAR(s) to PRD, I just installed Ivy separately and added the folder to the "libraries" property of the launcher.properties file:

libraries=lib:lib/jdbc:lib/bigdata:docs:../../apache-ivy-2.4.0-rc1

I think there's only one JAR of interest so it might be just as easy to toss it in report-designer/lib, but I like to keep a clean shop where possible :)

NOTE to Mac users: I don't know how to do this with the Mac application distribution, it's probably not difficult but I usually just download the regular distribution and run the shell script to start PRD.

Once this was done, I started PRD, created a new report, and created a new Advanced->Scripting data source:

I selected "groovy" as the engine, created a new query called "Get Pentaho Tweets", and entered the following script (gist here):

@Grab(group="org.twitter4j", module="twitter4j-core", version="4.0.2")
import twitter4j.*
import twitter4j.api.*
import twitter4j.conf.*
import org.pentaho.reporting.engine.classic.core.util.TypedTableModel;

colNames = ['Screen Name', 'Tweet Date', 'Text'] as String[]
colTypes = [String, Date, String] as Class[]
model = new TypedTableModel(colNames, colTypes);

ConfigurationBuilder cb = new ConfigurationBuilder();
cb.setDebugEnabled(true)
  .setOAuthConsumerKey(System.getenv('TWITTER_CONSUMER_TOKEN'))
  .setOAuthConsumerSecret(System.getenv('TWITTER_CONSUMER_TOKEN_SECRET'))
  .setOAuthAccessToken(System.getenv('TWITTER_ACCESS_TOKEN'))
  .setOAuthAccessTokenSecret(System.getenv('TWITTER_ACCESS_TOKEN_SECRET'));
TwitterFactory tf = new TwitterFactory(cb.build());
Twitter twitter = tf.getInstance();

def query = new Query('#pentaho')
def result     = twitter.search(query)
result?.tweets.each { tweet ->
  model.addRow( [tweet.user.screenName, tweet.createdAt, tweet.text] as Object[] )
}
return model;

Note that I am reading in token info from the environment, you can also change this to use System properties or just hard code them in, although I (and the folks at Twitter4J) highly recommend against it. Perhaps the best approach is to require some/all of the tokens as report parameters. I didn't try this but I suspect it's quite possible with a scripted datasource.

For goal #3, I think both the creation of the model and the population of the model with data speak to how easy and succinct it is. Take a look at the BeanShell sample (which *almost compiles to Groovy) for the Sample report with a Scripted Datasource to see how much more verbose it is. I use the "as" keyword/idiom quite often in Groovy, usually when interacting with a Java API.

Anyway, that's all there is to it! The script creates a 3-column table containing the last 100 tweets mentioning "#pentaho", with the columns being the tweeter's screen name, the date the tweet was created, and the text of the tweet itself. Be careful not to exceed your rate limit; I didn't put in a try/catch for the search call, but you can do that in order to return an empty table or something as you like.

As I said before, this approach should be very similar for the Script step in PDI, although you won't be populating a model; rather, you'll specify the fields in the step dialog and fill them with the script. For consistency and maintainability, you could always use the PDI transformation as the datasource in your PRD report, but that's not what I was trying to do for this post :)

Cheers!

Saturday, September 20, 2014

Using Apache Drill with PDI

One of the non-Pentaho side projects I've become interested in is Apache Drill, I like all the different aspects of it and hope to contribute in some meaningful way shortly :) As a first step, without touching any code, I thought I'd see if I could configure PDI and Drill to play nicely together. My Proof-Of-Concept was a single Table Input step in PDI, using a Generic driver to point at my local Drill instance. I was able to query the datasources as described in the Drill wiki.

At first I wanted to be able to start Drill in embedded mode, as my end goal was to be able to provide a PDI plugin so "Apache Drill" would show up on the Database types list in the Database Connection dialog. However I ran into a bunch of classloading issues (see previous post), so I thought I'd try a different approach that worked much better. Here's my procedure:

1) Install Drill and Zookeeper.
2) Start Zookeeper (I used Standalone mode)
3) Start Drill in Distributed mode (I used runbit). By default this should look for a localhost:2181 ZK
4) Add a relative path pointing at the <path_to_drill>/jars/jdbc-driver folder to the "libraries" property in PDI's data-integration/launcher/launcher.properties file. For my install, I used ../../../../apache-drill-0.5.0-incubating/jars/jdbc-driver
4) Open PDI and create a new Transformation with a Table Input step.
5) In the Table Input step dialog, create a new Connection with the following properties:

Click the Test button, it may take a while the first time but it should come back with Success. Don't click the Explore button or try to look for tables or columns, there is an Unsupported Operation in a few of the JDBC calls. The reason I've found is usually that a Drill class has subclassed an Avatica class and hasn't overridden the methods that PDI is calling. This is common with newer technologies that provide JDBC drivers; the JDBC API is huge so many young drivers don't implement all the methods.

6) Click OK to exit the Database Connection dialog
7) Enter your SQL query in the Table Input step
8) Click Preview or run the transformation

I was able to run queries like the ones in the wiki (see link above) as well as things like:

SELECT GENDER, AVG(salary) AS AVG_SALARY FROM cp.`employee.json` GROUP BY GENDER

I haven't tried this with anything but Table Input. In my experience if I am getting UnsupportedOperationExceptions in the Database Connection Dialog, I won't get very far with other Pentaho tools. This is due to the lack of implemented methods in the driver. As a Pentaho employee, I've done a few things to "fix" these on the fly. I looked into doing the same for Drill while trying to create a plugin for it, but I spent too many hours in Dependency Hell and eventually gave up (so far) trying to embed a Drill client/instance in a PDI plugin. Now that I have it working with Drill in Distributed mode, perhaps I will give it another try. If I can get around the classloading / service locator problems with Jersey, perhaps I'll achieve my end goal :)

Cheers!

PDI plugins and Dependency Hell

I've written quite a few plugins for Pentaho Data Integration, some are "finished" in terms of being in the PDI Marketplace, and some are still works in progress, Proofs of Concept, etc. The usual pattern for my plugins is to integrate some 3rd party open-source tech with PDI. For one-JAR solutions, this is pretty straightforward and rarely causes problems. For larger projects, however, this can be quite painful.

When a plugin is loaded into PDI, a new self-first ClassLoader is created, the JARs at the root of the plugin folder are scanned for plugin type annotations and plugin.xml files, and then the JARs in the lib/ folder of the plugin are added to the ClassLoader. This allows plugins to (sometimes) override the versions of JARs that are provided with PDI, and to provide their own JARs in a sandbox of sorts, so as not to infiltrate or influence other plugins.

However there are a few JARs that always seem to cause problems, where a plugin's version of the JAR will not play well with the version provided by PDI. Slf4J is one of these, as are the Jersey JARs. A lot of it has to do with which PDI classes are trying to use which of its own loaded classes while operating using a plugin which might have the same class in its own classloader. In this case, there are two separate versions of the class in the JVM, one loaded by the App classloader, and one by the plugin's classloader. As long as it's only the plugin referencing the class, there should be no problem; however, if the "duplicate" class is passed to a PDI-loaded class, then a ClassCastException can occur, as even though they appear to be the same class (same fully-qualified class name, e.g.), they are different due to the different classloaders.

I've run into this a few times in my projects, so I finally wised up and wrote a small Groovy script to look for common JARs between two paths. This script ignores the versions on the JARs (assuming they're named in "normal" style (name-version.jar). It doesn't work for JARs with classifiers (yet), it just does a quick substring looking for the last dash in the name. For each JAR "common" to the two paths, it then prints the versions in each path. This allows me to use transitive dependencies when building my plugin package, but then I can inspect which JARs might cause problems as they will already be provided by the PDI runtime when the plugin is deployed.

Here's the script code (Gist here):

commonLibs = [:]
leftLibs = []
rightLibs = []
if(args?.length > 1) {
    new File(args[0]).eachFileRecurse {
     fileName = (it.name - '.jar')
     fileNameNoVersion = fileName[0..fileName.lastIndexOf('-')-1]
     leftLibs += fileNameNoVersion
     commonLibs.putAt(fileNameNoVersion, commonLibs[fileNameNoVersion] ? (commonLibs[fileNameNoVersion] + [1:fileName]) : [1:fileName])
    }
    new File(args[1]).eachFileRecurse {
     fileName = (it.name - '.jar')
     fileNameNoVersion = fileName[0..fileName.lastIndexOf('-')-1]
     rightLibs += fileNameNoVersion
     commonLibs.putAt(fileNameNoVersion,  commonLibs[fileNameNoVersion] ? (commonLibs[fileNameNoVersion] + [2:fileName]) : [2:fileName])
    }
    leftLibs.intersect(rightLibs).each { lib ->
      println "${commonLibs[lib]}"
    }
}
else {
  // print usage
  println 'Usage: jardiff  \nDisplays the JARs common to both paths (ignoring version) and the versions of those JARs in each path.'
}

When I have more time and interest, I will switch the lastIndexOf() to a regex instead, to make it more robust, and I'm sure I can make things more "Groovy". However I just needed something to get me unstuck so I can start integrating more cool tech into PDI. Stay tuned to this blog and the Marketplace, hopefully I'll have something new soon :)

Cheers!

Monday, April 28, 2014

PDI Plugin POJOs

I've been trying to figure out ways to make it dead-simple to create new plugins for Kettle / Pentaho Data Integration, and as a result I've got some GitHub projects using various approaches:

- pentaho-plugin-skeletons: Skeleton projects with the build artifacts and classes in-place, with heredoc explaining how to accomplish certain tasks. So far I only have the pdi-step-plugin-skeleton with a Gradle build and the heredoc is fairly lacking as I haven't had the time to fill in sample snippets. I also want to add a .travis.yml for folks that would like to use Travis for their CI needs.

- GroovyConsoleSpoonPlugin: This project serves three purposes: first, it can build a Spoon plugin that allows the user to bring up a Groovy Console within Spoon, which can access the entire environment and manipulate it in fairly cool ways (see my previous posts). Second, it can start a standalone Groovy Console that brings in Kettle dependencies in order to prototype features without a full running PDI environment. Third, it offers an internal DSL (based on Groovy of course) to make exploration of the PDI environment as simple as possible. This includes overloaded operators, additions to the Kettle API/SDK, and all sorts of helper methods/classes to make life easier. At some point I will likely move the DSL out of this project to a standalone project, but that won't be anytime soon I'm afraid.

- pdi-pojo: This project is the real subject of this blog post, and it aims to allow the PDI plugin developer to create a single class that extends some pdi-pojo class which provides all the boilerplate and "normal" handling of PDI plugin issues.

The pdi-pojo approach is to provide superclasses for common plugins (such as StepPluginPOJO) which provide delegates, default implementations, etc. for said plugins, thereby reducing the boilerplate code needed to get up and running with a new PDI plugin. This is accomplished by annotating fields in the subclass to indicate how they should be handled by the framework.

Here is an example (taken from the sample code in the project itself) for a StepPluginPOJO subclass that declares its fields as public members (you can also make them protected/private if there are bean getter/setter methods):

@Step( id = "TestStepPluginPOJO",
image = "test-step-plugin-pojo.png",
name = "StepPluginPOJO Test",
description = "Tests the StepPluginPOJO",
categoryDescription = "Experimental" )
public class TestStepPluginPOJO extends StepPluginPOJO {

@UI( label = "Enter value" )
public String testString;

@ExcludeMeta
public String testExcludeString;

public int testInt;

public boolean testBool;

@UI( hint = "Checkbox" )
public boolean testBoolAsText;

@UI( label = "Cool bool", value = "true" )
public boolean testBoolWithLabel;

@UI( label = "Start date", hint = "Date" )
public Date startDate;

@UI( label = "End TOD", hint = "Time" )
public Date endTime;

@NewField
public String status;

The public member variables will be processed on initialization of the plugin to determine which are metadata fields, which need UI representation in the plugin's dialog, etc. By default, all public member variables are treated as metadata fields; to exclude a variable you use the ExcludeMeta annotation (see above).

Perhaps the most helpful of the annotations is the UI annotation, as this will indicate to the framework how to display and handle the graphical user element(s) associated with the member. If a UI annotation (or a hint as to how it should be displayed) is absent, the framework will choose a default representation. For example, a boolean member will be (by default) displayed as a checkbox, but a String is displayed as a text field (with Kettle variables allowed within).

For the members above, the following dialog is displayed:

Note how the "value" attribute of the UI annotation will set the default value (or a default is determined based on type).

These members are just for testing the various UI components and annotations; the only one I'm using in the code the is the "status" field, which is annotated as a NewField, which means the framework will add the metadata to the outgoing row, so all you have to do is find the index of that field by name using indexOf(), then storing your value into the output row at that index (see code below).

Although almost all of the boilerplate for a step plugin is taken care of by StepPluginPOJO, one method remains abstract: processRow(). This is ensure that your subclass is actually doing something :)

For the example, I'm basically creating a simplified version of the Add Constants step, by adding a field called "status" to the row and setting its value to the value specified in the dialog box. Here is the body of the processRow() method, followed by a screen shot of the test transformation:

@Override
public boolean processRow( StepMetaInterface smi, StepDataInterface sdi ) throws KettleException {
Object[] r = getRow(); // get row, set busy!
// no more input to be expected...
if ( r == null ) {
setOutputDone();
return false;
}

if ( first ) {
first = false;
outputRowMeta = getInputRowMeta().clone();
getFields( outputRowMeta, getStepname(), null, null, this, repository, getMetaStore() );
}

// Allocate room on the row for the new fields
Object[] newRow = RowDataUtil.resizeArray( r, r.length + getNewFields().size() );

// Do processing here, add new field(s)
int newFieldIndex = outputRowMeta.indexOfValue( "status" );
if ( newFieldIndex == -1 ) {
throw new KettleException( "Couldn't find field 'status' in output row!" );
}
newRow[newFieldIndex] = status;

putRow( outputRowMeta, newRow ); // copy row to possible alternate rowset(s).

if ( checkFeedback( getLinesRead() ) ) {
if ( isBasic() ) {
logBasic( "Lines read" + getLinesRead() );
}
}

return true;
}

In this case I set the status variable to "Not sure", and my Generate Rows step created 10 rows with a String field named "x" containing the value "Hello". Here are the results, as expected:

I wondered what the performance impact would be for creating a step plugin with StepPluginPOJO versus writing the plugin by hand. I tried to keep the processRow() as simple as possible while trying to match it up to an existing step (hence the Add Constants example). I ran both the above transformation and its counterpart (with Add Constants replacing my test step) with a million rows, and for the most part I got the same performance. The biggest difference was a single run where the Add Constants version ran a million rows in 0.6 seconds and the TestStepPluginPOJO ran in 0.7 seconds.

Because most of the reflection is done at initialization and/or UI time, the performance should be close to writing the same plugin by hand. Of course, if performance is your goal you should write the plugin by hand to fine-tune all aspects of its execution. This project is here for rapidly prototyping plugins.

Having said that, you can use pdi-pojo and incrementally move away from the provided code by overriding whichever methods you choose. For example, if the auto-generated UI is too ugly or primitive for you, you can override the getDialogClassName() method and return one you implement by hand. If you want your own initialization code, override the init() method, and so on.

So there's lots more work to do, but I figure I have a good enough start to blog about this and to welcome any contributions to the project. If you get a chance to try it out, please let me know how it works for you! I will try to get the JAR onto Bintray or Sonatype or something like that shortly, but in the meantime just fork and clone my repo, run "gradle clean plugin", and then you can drop the JAR into whatever project you want (or publish it to your local Maven repo or whatever).

Cheers!

Thursday, March 6, 2014

Gradle Spoon Console Plugin

As my blog followers know, I've been trying to get a Groovy Console into Pentaho Data Integration's (PDI's) Spoon UI for quite a while now. I haven't put it into the Marketplace as we're wrestling with Groovy versions and I wanted to offer something that worked out-of-the-box versus something that needed extra setup/config. For PDI 5.1 this should be a moot point...

...then I realized, I've been trying to bring the Groovy Console to PDI. Why not bring PDI to the Groovy Console? This idea was borne from a blog post by Mike Hugo with a tip of the hat to carlosgsouza who created a Gradle plugin from it.

The idea of the post (and thus mine) is that you can leverage Gradle to fetch your dependencies and keep track of your needed classpath, then start a Groovy Console with that classpath passed in. Thus you can access your project's dependencies (APIs, etc.) via the console, allowing for prototyping, rapid development, etc. of features.

For PDI, my GroovyConsoleSpoonPlugin project already had the necessary requirements: it had PDI/Kettle dependencies, it leveraged the Groovy Console, and it used Groovy to provide an internal Domain-Specific Language (DSL) for interacting with the PDI environment. After fixing a few bugs in my project related to working with the Groovy Console outside of Spoon, I'm happy to announce that the first version of the Gradle Console for PDI is ready to roll :)

To use, get Gradle and then just clone my GitHub project and run "gradle console" (or if you don't want/have Gradle, go to your clone and use "gradlew console"). You will be presented with a Groovy Console with PDI core dependencies, steps, etc. available, as well as my first attempt at an internal DSL for building transformations and jobs (see earlier blog posts). To change the PDI version, just edit the build.gradle file looking for the "project.ext.kettle_dependency_revision" property (yes, I know that's deprecated ;)

Here's an example of creating and running a (VERY!) basic transformation. I hope to implement a Builder soon to fill in metadata, etc. :

I'm hoping to make this DSL (and the whole experience) as easy yet powerful as possible, please let me know your ideas!

PDI memcached plugins

I've definitely been neglecting this blog :) but I've recently put a couple of plugins into the PDI marketplace to read and write from memcached instances. I hope this leads to more key/value store support (I'm working on Hazelcast plugins right now) for PDI.

Anyway the Memcached Input/Output plugins allow for the reading and writing of key/value pairs to memcached instances. The Input step requires incoming key names, and the type of the values that will come out:

The Output step requires and value field names (it interrogates the types from the row metadata):

This is just an initial offering, eventually I'd like to add other features (I'm always open to suggestions for improvement!). But I figured I should get a foot in the door for key/value store support :) I also have been working on a Map (key/value) type for PDI, that should be out in 5.1 or before if I can get some supporting changes into the 5.0.x code line.

Please feel free to install this from the PDI Marketplace and let me know how it works for you and/or how it could improve :) I only tested this on PDI 5.0, so I'm not sure it will work on PDI 4.x but I don't see any reason it shouldn't.

Cheers!

Saturday, March 1, 2014

Groovy Memcached "client"

Ok so this post is not PDI related (yet, stay tuned :) but in my search for easy memcached client UIs I came up fairly shorthanded unless I wanted to buy something, write a Java client, or install a package. All I needed to do for my test was to be able to set and get a couple of values and I didn't really want to start a new project, build a Java app, etc.

There's a very easy way to do this with Groovy Grapes and the spymemcached client (and the Groovy Console as a "UI"). Just @Grab the spymemcached JAR, connect a client, and off you go:

The script is as simple as this:

@Grab('net.spy:spymemcached:2.10.5')

import net.spy.memcached.MemcachedClient

def memcachedClient = new MemcachedClient( new InetSocketAddress('127.0.0.1', 11211 ) );

memcachedClient.set('myKey',3600, "Hello world!")

memcachedClient.set('intKey',3600, 45)

println "myKey = ${memcachedClient.get('myKey')}"

println "intKey = ${memcachedClient.get('intKey')}"

It's pretty easy to change this to a command-line interface (CLI) to take parameters for keys, expiration times, etc. Note I'm using the synchronous gets/sets but this is just for testing.

Anyway this came about while writing my Memcached Input/Output step plugins for PDI, keep your eye on the Marketplace, I hope to have them released this week. Cheers!

Fun with Pentaho Data Integration