Mon Jun 14, 2021 1:58 am
Login Register Lost Password? Contact Us


Splitting Files with ECL

Post questions or comments on how best to manage your big data problem

Thu Oct 22, 2020 10:05 pm Change Time Zone

Is there a way of splitting one large CSV file into several parts before the output?
mo0926
 
Posts: 16
Joined: Thu Jan 09, 2020 3:57 pm

Fri Oct 23, 2020 12:55 pm Change Time Zone

Hello mo0926,

The first thing that came to my mind was to use the LOOP function to iterate across the CSV dataset and perform a loopbody operation. Have you already tried that?
Maybe if you want to share more details on what you are trying to accomplish (a code example for instance), we can think on some other possibilities as well.

HTH
hwatanuki
hwatanuki
 
Posts: 28
Joined: Mon Apr 15, 2019 1:22 am

Mon Oct 26, 2020 11:57 am Change Time Zone

mo0926,
Is there a way of splitting one large CSV file into several parts before the output?
There are several fairly simple ways to do that, depending on how you want them split.

If you want all records with different values of some field(s), then simple filters for those values and a separate OUTPUT for each will do it.

If you just want to "evenly" split the number of records into separate files, then either the SAMPLE() or ENTH() function will do that, with a separate OUTPUT for each.

If what you want is a single OUTPUT but specific records "assigned" to each file part written to your nodes, then you can use the GROUP() function and OUTPUT that GROUPed recordset.

So, what exactly are you trying to accomplish?

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1582
Joined: Wed Oct 26, 2011 7:40 pm

Mon Oct 26, 2020 2:16 pm Change Time Zone

Hi Richard, I was trying to evenly split one large file into two smaller files, and write them as such later. I think the option of using the enth function might me helpful here. However, the only problem I see with this solution is that I cannot be sure that some records won't be repeated in both final datasets.
mo0926
 
Posts: 16
Joined: Thu Jan 09, 2020 3:57 pm

Mon Oct 26, 2020 2:28 pm Change Time Zone

SAMPLE should be more exact for that purpose.
Tony Kirk
 
Posts: 17
Joined: Thu Jun 23, 2011 5:01 pm

Mon Oct 26, 2020 8:48 pm Change Time Zone

Thank you for your help. I found this solution below that works too:


I round in case the total number of records is an odd number.

totalrecs := round(count(address_file));

halfoftotal := totalrecs / 2;

addressHalf1 := choosen(address_file, halfoftotal );// first half records
addressHalf2 := choosen(address_file, halfoftotal, halfoftotal ); // second half
mo0926
 
Posts: 16
Joined: Thu Jan 09, 2020 3:57 pm

Tue Oct 27, 2020 12:44 pm Change Time Zone

mo0926,

I think you need to tweak that code a bit. :)
Run this example:
Code: Select all
Address_File := DATASET([{1},{2},{3},{4},{5}],{UNSIGNED1 r});
halfoftotal  := COUNT(address_file) DIV 2;

addressHalf1 := choosen(address_file, halfoftotal );// first half records
addressHalf2 := choosen(address_file, halfoftotal, halfoftotal ); // second half
addressHalf1;   // returns 1,2
addressHalf2;   // returns 2,3

//my version:
Half1 := CHOOSEN(address_file, halfoftotal );
Half2 := CHOOSEN(address_file, halfoftotal+1, halfoftotal+1 );
Half1;          // returns 1,2
Half2;          // returns 3,4,5
As you see, your code gets records 1,2 in the "first half" and 2,3 as the "second half" while my rewrite gets records 1,2 in the "first half" and 3,4,5 as the "second half." Also note that I removed your ROUND function and replaced it with the DIV operator (integer division) to handle the odd number issue.

Of course, assuming you're on a cluster with more than one node, you'll also want to add DISTRIBUTE() to each OUTPUT to disk so each file part has records (otherwise the file parts for the first half will be empty on the second half's nodes, and vice versa).

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1582
Joined: Wed Oct 26, 2011 7:40 pm

Tue Oct 27, 2020 1:18 pm Change Time Zone

Nice!! Yeah, I noticed I was getting that one last record in the beginning of the next dataset. In the documentation for choosen it shows "NextFive := CHOOSEN(Person,5,6); // returns next 5 recs from Person" that number six should have meant a +1 for this. Thank you for clarifying that.
mo0926
 
Posts: 16
Joined: Thu Jan 09, 2020 3:57 pm

Tue Oct 27, 2020 2:43 pm Change Time Zone

mo0926,

And here's an even easier way to do it:
Code: Select all
Half1 := address_file[1 .. halfoftotal];
Half2 := address_file[halfoftotal+1 ..];
Half1;
Half2;


HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1582
Joined: Wed Oct 26, 2011 7:40 pm

Tue Oct 27, 2020 3:26 pm Change Time Zone

Thanks for the advice! I've ended up with tons of .csv files after the data gathering part of an investment simulation in this Budapest real estate was over, and had to split them to make them easier to process by different members of my team, as having to deal with a larger file that contains unneeded information was difficult and ineffective.

And now, I've managed to divide them thanks to your code, and my job has become much simpler.
McPP82
 
Posts: 6
Joined: Wed Sep 30, 2020 5:02 pm


Return to Managing Big Data

Who is online

Users browsing this forum: No registered users and 1 guest

cron