Sun Apr 22, 2018 7:06 am
Login Register Lost Password? Contact Us


Use PIPE to unzip file.

Comments and questions related to the Enterprise Control Language

Thu Jun 05, 2014 5:10 pm Change Time Zone

I have the following code that should use pipe to unzip a .gz file:

Code: Select all
rec := RECORD
   DATA D;
END;
filename:='~file::10.239.40.5::var::lib::^H^P^C^C^Systems::dropzone::head100_1025b_election_retweets.csv.gz';
higgsDS :=DATASET(filename,rec,FLAT);
unzipped_higgsDS := PIPE(higgsDS,'gzip',OUTPUT(CSV));


I'm unsure if I have my command, 'gzip' in the correct form. In addition, I don't know if gzip is available on the cluster I'm using, i.e. thor on the Machine Learning Dev Cluster, 10.239.40.2.
tlhumphrey2
 
Posts: 242
Joined: Mon May 07, 2012 6:23 pm

Thu Jun 05, 2014 8:25 pm Change Time Zone

Tim,

I would use the PIPE option on the DATASET declaration (an input pipe) instead of the PIPE function (a through pipe), since the file on disk is zipped. Since the file seems to be on the dropzone, the PIPE option on DATASET should unzip it (assuming the gzip program is available to use) as it reads the file.

Your RECORD structure, however, should reflect the structure of the unzipped records, and not just define a single blob field. So I would try doing it something like this:
Code: Select all
rec := RECORD
   //DATA D; //put your real field definitions in here
END;
filename:='~file::10.239.40.5::var::lib::^H^P^C^C^Systems::dropzone::head100_1025b_election_retweets.csv.gz';
higgsDS :=DATASET(filename,rec,PIPE('gzip',CSV));
OUTPUT(higgsDS);

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1353
Joined: Wed Oct 26, 2011 7:40 pm

Fri Jun 06, 2014 1:58 pm Change Time Zone

Here is what my new code looks like:

Code: Select all
rec := RECORD
  STRING field1;
  STRING field2;
  STRING field3;
  STRING field4;
END;
filename:='~file::10.239.40.5::var::lib::^H^P^C^C^Systems::dropzone::head100_1025b_election_retweets.csv.gz';
higgsDS :=DATASET(filename,rec,PIPE('gzip',CSV));
OUTPUT(higgsDS);


But, I'm getting the following error message: Error: System error: -1: Graph[1], diskread[2]: SLAVE 10.239.40.6:20100: CFileSerialStream::get read past end of stream, CFileSerialStream::get read past end of stream - handling file: //10.239.40.5:7100/var/lib/HPCCSystems/dropzone/head100_1025b_election_retweets.csv.gz

Workunit on 10.239.40.2 is W20140606-094758.

By the way, this dataset only has 100 lines (records).
tlhumphrey2
 
Posts: 242
Joined: Mon May 07, 2012 6:23 pm

Fri Jun 06, 2014 2:29 pm Change Time Zone

Tim,

This error usually indicates it's not finding any record delimiters (which it wouldn't be likely to do in a compressed binary file) so it sounds to me like the unzip is not happening.

First, you should verify that gzip is present and capable of unzipping the file.

If it is present, it's possible that it's not getting the proper command line parameters to do the unzip. I would test that first by doing a command line unzip yourself to discover what it takes to make that work, and then take what you learn from that and apply it to the PIPE option.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1353
Joined: Wed Oct 26, 2011 7:40 pm

Tue Apr 10, 2018 5:31 pm Change Time Zone

Can someone point us to a working example of this type of thing?

E.g... a gz/tar/zip file sprayed to the cluster, then read and passed to a PIPE command to decompress the file via Linux commands?

I'm sure there are actual-working examples, just haven't found them.
Thanks.
jwilt
 
Posts: 50
Joined: Wed Feb 27, 2013 7:46 pm

Wed Apr 11, 2018 5:20 pm Change Time Zone

The following isn't my original problem. But, it is very similar. And, it works. Below, I give the code that unzips a file I have on the dropzone. And, below it I give the unzipped file. There is one oddity. I expected the 1st OUTPUT statement to output to the workunit the contents of the unzipped file. But, it only outputs the column headers. So, I added a 2nd OUTPUT statement which does output the contents of the unzipped file.

Code: Select all
rec := RECORD
  UNSIGNED Elevation;
  UNSIGNED Aspect;
  UNSIGNED Slope;
END;
UnzipFilename := '~file::10.18.9.212::var::lib::^H^P^C^C^Systems::mydropzone::myfile_head3.csv'; //the uncompressed file
ZipFilename  := '/var/lib/HPCCSystems/mydropzone/myfile_head3.csv.gz'; //the zipped container
UnzipCommand := 'gunzip ' + ZipFilename; //the command line that produces the uncompressed file
unzippedDS :=DATASET(UnzipFilename,rec,PIPE(UnzipCommand,CSV));
OUTPUT(unzippedDS,NAMED('unzippedDS'));
OUTPUT(DATASET(UnzipFilename,rec,CSV(HEADING(1),SEPARATOR(','),TERMINATOR(['\n','\r\n','\n\r']))),NAMED('the_unzipped_dataset'));


Here is the unzipped file's content:
Code: Select all
Elevation,Aspect,Slope
2596,51,3
2590,56,2
2804,139,9
2785,155,18
2595,45,2
2579,132,6
2606,45,7
2605,49,4
2617,45,9
2612,59,10
tlhumphrey2
 
Posts: 242
Joined: Mon May 07, 2012 6:23 pm

Sun Apr 15, 2018 11:32 pm Change Time Zone

A couple examples of a slightly different approach...

Code: Select all
rec := RECORD
  UNSIGNED Elevation;
  UNSIGNED Aspect;
  UNSIGNED Slope;
END;
ZipFilename := '10.18.9.212:/var/lib/HPCCSystems/mydropzone/myfile_head3.csv.gz'; //the zipped container
tempf := '/tmp/myFile' + WORKUNIT + '.gz';
UnzipCommandRaw := ''
  'scp ' + ZipFileName + ' ' + tempf + ' 2>&1;' +
  'gzip -d --stdout ' + tempf + ' 2>&1; ' +
  'rm -rf ' + tempf + ';' +
  '';

UnzipCommand := 'bash -c \'' + UnzipCommandRaw + '\'';
unzippedDS := PIPE(UnzipCommand, rec, CSV(TERMINATOR(['\n','\n\r','\r\n']), SEPARATOR(','), QUOTE('"')));

// Save the unzippedDS...


...this one runs on the web at play.hpccsystems.com:8010:

Code: Select all
rec := RECORD
  UNICODE YearofBirth;
  UNICODE Gender;
  UNICODE Ethnicity;
  UNICODE ChildsFirstName;
  UNICODE Count;
  UNICODE Rank;
END;

ZipFilename := '10.0.0.208:/var/lib/HPCCSystems/mydropzone/Most_Popular_Baby_Names_by_Sex_and_Mother_s_Ethnic_Group__New_York_City.csv.gz'; //the uncompressed file
tempf := '/tmp/myFile_' + WORKUNIT + '.gz';

// Copy the file to a local temp file, unzip it to STDOUT, remove the temp file
UnzipCommandraw := '' +
  'scp ' + ZipFilename + ' ' + tempf + ' 2>&1;' +
  'gzip -d --stdout ' + tempf + ' 2>&1;' +
  'rm -rf ' + tempf + ' 2>&1;' +
  '';

// Wrap whatever CMDraw script with a bash command
UnzipCommand := 'bash -c \'' + UnzipCommandraw + '\'';

unzippedDS := PIPE(UnzipCommand, rec,
  CSV(HEADING(1),
  TERMINATOR(['\n','\n\r','\r\n']),
  SEPARATOR(','),
  QUOTE('"')));
OUTPUT(unzippedDS,NAMED('unzippedDS'));
jwilt
 
Posts: 50
Joined: Wed Feb 27, 2013 7:46 pm


Return to ECL

Who is online

Users browsing this forum: No registered users and 1 guest