Fri Jul 03, 2020 8:42 pm
Login Register Lost Password? Contact Us


pentaho kettle with hpcc plugins

Questions and comments related to the Pentaho Kettle and Spoon plugins

Tue Feb 05, 2013 11:28 am Change Time Zone

Hi,

I'm using pentaho kettle with hpcc plugins where i get most of the options like SPRAY,DATASET,OUTPUT,PROJECT,DEDUP,GENERIC CODE and many more.

Here i have sprayed two physical files from my local to hpcc system using SPRAY option in pentaho kettle which is working fine.Suppose if i want to write any function using pentaho kettle,can i do that ? I tried writting function using GENERIC CODE option where my whole logic resides but I'm unable to run the job as it is not loading onto the pentaho kettle IDE.
sapthashree
 
Posts: 15
Joined: Wed Jul 11, 2012 6:04 am

Tue Feb 05, 2013 2:39 pm Change Time Zone

When you entered the code in the Generic code job entry are you saying it didn't execute or it didn't save the code?

If it is an execution problem can you check your work unit in the ecl watch interface (http://[clusterip]:8010) and see if it included your custom code and post any error that it has listed.

As far as adding custom features please check the github account you can download the source and extend it as needed or if there are particular needs post them here, its no gurantee that someone working on the plugins will get to it but it will help us to know what features people are looking for.

The two related github projects are:
https://github.com/hpcc-systems/java-ecl-api
https://github.com/hpcc-systems/spoon-plugins
joe.chambers
 
Posts: 21
Joined: Wed Apr 27, 2011 1:07 pm

Wed Feb 06, 2013 6:06 am Change Time Zone

Hi,
Actually when i entered the code in GENERIC CODE job entry it is saving the job ,as soon as i run the job it's saying "Unable to load the job from XML file [D:\HPCC\Pentaho_Kettle\spoon_projects\spoon\re_trial.kjb]" and "Error reading information from input stream".

Actually what i have done in ecl is :
In one builder i have written the function ,the code as below
Code: Select all
EXPORT tagModule := MODULE

dataDictRec := RECORD
   INTEGER Keynum;
   STRING Parent;
   STRING Key;
   STRING Related;
END;
dictInfoData := DATASET('~kettle::trial::dictdata', dataDictRec, THOR);       
strRec := RECORD
STRING strTag;
END;

EXPORT STRING tagFunction(STRING strTitleToTag, STRING strLinkToTag) := FUNCTION
titleLinkStr :=  Std.Str.ToUpperCase(strTitleToTag + ' ' +strLinkToTag);
   
strRec FindTag( dictInfoData L ) := TRANSFORM
   SELF.strTag := IF ((REGEXFIND(L.Related, titleLinkStr)), L.Key, SKIP);
END; 
foundTags := PROJECT( dictInfoData, FindTag(LEFT));       
RETURN foundTags[1].strTag;
END;
END;


And in another builder,

Code: Select all
AssetMetadataRec := RECORD
   STRING AssetId;
   STRING AssetIdHash;
   STRING Title;
   STRING LastModifiedDateTime;
   STRING InsertDateTime;
   STRING assetTag;
END;
assetMetaDataDS := DATASET( '~kittle::trial::metadata',AssetMetadataRec,THOR );
AssetMetadataRec tagXfm(AssetMetadataRec L) := TRANSFORM
      SELF.assetTag := tagModule.tagFunction(L.Title, L.AssetId);
      SELF := L;
END;
taggedAssetMetaDataDS := PROJECT( assetMetaDataDS, tagXfm(LEFT) );
taggedAssetMetaDataDS;


The above code is working fine in ECL.I want to do the same thing in pentaho kettle with hpcc plugins.For that i have written the above whole code in a GENERIC CODE job entry and as i said its not loading onto pentaho spoon IDE. If i write only the fuction part in GENERIC CODE job entry and rest using other job entries like DATASET and PROJECT, the function(tagFunction) is not available to the transform(tagXfm).
How it can be done?
sapthashree
 
Posts: 15
Joined: Wed Jul 11, 2012 6:04 am

Fri Feb 08, 2013 4:50 pm Change Time Zone

Can you post a small sample of your dataset you are using (a few lines of each) and I will test in more detail?

The one thing is in the IDE you export and import modules where as pentaho doesn't have this feature, everything is done in one "file".

Staying with the use of custom code in the generic code job entry give this a try. Note this compiles within pentaho but since I don't have your data I haven't tested actual execution. However I would create the non module part using the spoon job entries.

Code: Select all
tagModule := MODULE

dataDictRec := RECORD
   INTEGER Keynum;
   STRING Parent;
   STRING Key;
   STRING Related;
END;
dictInfoData := DATASET('~kettle::trial::dictdata', dataDictRec, THOR);       
strRec := RECORD
STRING strTag;
END;

EXPORT STRING tagFunction(STRING strTitleToTag, STRING strLinkToTag) := FUNCTION
titleLinkStr :=  Std.Str.ToUpperCase(strTitleToTag + ' ' +strLinkToTag);
   
strRec FindTag( dictInfoData L ) := TRANSFORM
   SELF.strTag := IF ((REGEXFIND(L.Related, titleLinkStr)), L.Key, SKIP);
END;
foundTags := PROJECT( dictInfoData, FindTag(LEFT));       
RETURN foundTags[1].strTag;
END;
END;

    AssetMetadataRec := RECORD
       STRING AssetId;
       STRING AssetIdHash;
       STRING Title;
       STRING LastModifiedDateTime;
       STRING InsertDateTime;
       STRING assetTag;
    END;
    assetMetaDataDS := DATASET( '~kittle::trial::metadata',AssetMetadataRec,THOR );
    AssetMetadataRec tagXfm(AssetMetadataRec L) := TRANSFORM
          SELF.assetTag := tagModule.tagFunction(L.Title, L.AssetId);
          SELF := L;
    END;
    taggedAssetMetaDataDS := PROJECT( assetMetaDataDS, tagXfm(LEFT) );
    taggedAssetMetaDataDS;

joe.chambers
 
Posts: 21
Joined: Wed Apr 27, 2011 1:07 pm

Tue Feb 19, 2013 11:17 am Change Time Zone

Hi,

Can we write roxie queries in pentaho kettle?If yes what should be the job entry needs to be used for writing roxie queries and publishing it.

Also i have one more query to ask, can we create dependencies between two jobs in pentaho kettle i.e.,suppose if i have two jobs say job1 & job2 ,can i send output of job1 as an input to job2(i.e.,if i run the job2 it should internally call job1 inorder to excecute both the jobs)
sapthashree
 
Posts: 15
Joined: Wed Jul 11, 2012 6:04 am

Tue Feb 19, 2013 2:49 pm Change Time Zone

No we don't currently support ROXIE.

As for subjobs it would actually be best to output your data on thor as a thor file and then on the next job utilize this data. If you try to use kettle to move the data between the jobs that will require it download the entire dataset and push it back to the cluster which wouldn't be the best approach.

If you need to move the data into a non thor pentaho mode then do an output which will write it out as a csv and in the next job you can pick up this CSV.

The moving data from the ecl plugins into the default kettle plugins does need a little refining and this is something we could implement in the future if there is enough demand. Let me know if the two solutions above will work for you.
joe.chambers
 
Posts: 21
Joined: Wed Apr 27, 2011 1:07 pm

Tue Feb 19, 2013 2:59 pm Change Time Zone

For ROXIE support there are a few built in features that would allow you to fetch data.

Take a look at this http://wiki.pentaho.com/display/EAI/Web+services+lookup the experimental plugin they have listed here looks like you can point it to the WSDL generated by ROXIE which would allow you to call roxie.

There are probably additional SOAP interfaces as well as JSON plugins that may work.
joe.chambers
 
Posts: 21
Joined: Wed Apr 27, 2011 1:07 pm

Wed Feb 20, 2013 10:02 am Change Time Zone

Hi Joe,
As you said in the previous post for subjobs it would actually be best to output the data on thor as a thor file and then on the next job utilize this data.I tried with the same i.e.,saved the output of job1 as a thor file and used the same thor file in job2.If i do so untill and unless i run the job1,job2 doesn't get executed which throws error so i need to run job1 prior to job2(here i need to run both the jobs).But the thing which i wanted to do is to run only job2 which internally calls job1 .Can we establish dependencies between thses two jobs.Is there any alternative to do this?


Also you said about web services lookup to point to the WSDL generated by ROXIE which would allow you to call roxie.But i'm trying to write and publish roxie using pentaho kettle but as you said pentaho kettle doesn't support ROXIE.
sapthashree
 
Posts: 15
Joined: Wed Jul 11, 2012 6:04 am

Tue Apr 16, 2013 9:26 am Change Time Zone

Hi,

I have some doubts regarding code generation in pentaho kettle.
Whenever we run the job in pentaho kettle the ECL code is going to generate for the user everytime regardless of any changes made into a job.So my concern is that when i run the job 2nd time without any changes it should not generate ECL code again and should generate code only when there is change in the job.(i.e. it should not generate same ECL code again and again when there is no change/modification in the job). Is it possible??
sapthashree
 
Posts: 15
Joined: Wed Jul 11, 2012 6:04 am

Tue Apr 16, 2013 3:02 pm Change Time Zone

Publishing ROXIE queries and additional ROXIE features are on are list of features we would like to implement but we haven't gotten there yet.

Yes the code regeneration has been considered, it actually takes very little time. However it is one item we have looked at to increase efficiency. One feature we have looked at is allowing you to choose a work unit ID and resubmit it via pentaho. There are some limits within the spoon paradigm that we haven't addressed in order to prevent repetitive code generation.

You can call sub jobs using spoon. For the more complex jobs I usually break it into smaller jobs and then have a job that just calls the smaller jobs.
joe.chambers
 
Posts: 21
Joined: Wed Apr 27, 2011 1:07 pm

Next

Return to Pentaho Kettle

Who is online

Users browsing this forum: No registered users and 1 guest