Splitting Files with ECL
Is there a way of splitting one large CSV file into several parts before the output?
- mo0926
- Posts: 18
- Joined: Thu Jan 09, 2020 3:57 pm
Hello mo0926,
The first thing that came to my mind was to use the LOOP function to iterate across the CSV dataset and perform a loopbody operation. Have you already tried that?
Maybe if you want to share more details on what you are trying to accomplish (a code example for instance), we can think on some other possibilities as well.
HTH
hwatanuki
The first thing that came to my mind was to use the LOOP function to iterate across the CSV dataset and perform a loopbody operation. Have you already tried that?
Maybe if you want to share more details on what you are trying to accomplish (a code example for instance), we can think on some other possibilities as well.
HTH
hwatanuki
- hwatanuki
- Posts: 28
- Joined: Mon Apr 15, 2019 1:22 am
mo0926,
If you want all records with different values of some field(s), then simple filters for those values and a separate OUTPUT for each will do it.
If you just want to "evenly" split the number of records into separate files, then either the SAMPLE() or ENTH() function will do that, with a separate OUTPUT for each.
If what you want is a single OUTPUT but specific records "assigned" to each file part written to your nodes, then you can use the GROUP() function and OUTPUT that GROUPed recordset.
So, what exactly are you trying to accomplish?
HTH,
Richard
There are several fairly simple ways to do that, depending on how you want them split.Is there a way of splitting one large CSV file into several parts before the output?
If you want all records with different values of some field(s), then simple filters for those values and a separate OUTPUT for each will do it.
If you just want to "evenly" split the number of records into separate files, then either the SAMPLE() or ENTH() function will do that, with a separate OUTPUT for each.
If what you want is a single OUTPUT but specific records "assigned" to each file part written to your nodes, then you can use the GROUP() function and OUTPUT that GROUPed recordset.
So, what exactly are you trying to accomplish?
HTH,
Richard
- rtaylor
- Community Advisory Board Member
- Posts: 1619
- Joined: Wed Oct 26, 2011 7:40 pm
Hi Richard, I was trying to evenly split one large file into two smaller files, and write them as such later. I think the option of using the enth function might me helpful here. However, the only problem I see with this solution is that I cannot be sure that some records won't be repeated in both final datasets.
- mo0926
- Posts: 18
- Joined: Thu Jan 09, 2020 3:57 pm
Thank you for your help. I found this solution below that works too:
I round in case the total number of records is an odd number.
totalrecs := round(count(address_file));
halfoftotal := totalrecs / 2;
addressHalf1 := choosen(address_file, halfoftotal );// first half records
addressHalf2 := choosen(address_file, halfoftotal, halfoftotal ); // second half
I round in case the total number of records is an odd number.
totalrecs := round(count(address_file));
halfoftotal := totalrecs / 2;
addressHalf1 := choosen(address_file, halfoftotal );// first half records
addressHalf2 := choosen(address_file, halfoftotal, halfoftotal ); // second half
- mo0926
- Posts: 18
- Joined: Thu Jan 09, 2020 3:57 pm
mo0926,
I think you need to tweak that code a bit.
Run this example:
Of course, assuming you're on a cluster with more than one node, you'll also want to add DISTRIBUTE() to each OUTPUT to disk so each file part has records (otherwise the file parts for the first half will be empty on the second half's nodes, and vice versa).
HTH,
Richard
I think you need to tweak that code a bit.

Run this example:
- Code: Select all
Address_File := DATASET([{1},{2},{3},{4},{5}],{UNSIGNED1 r});
halfoftotal := COUNT(address_file) DIV 2;
addressHalf1 := choosen(address_file, halfoftotal );// first half records
addressHalf2 := choosen(address_file, halfoftotal, halfoftotal ); // second half
addressHalf1; // returns 1,2
addressHalf2; // returns 2,3
//my version:
Half1 := CHOOSEN(address_file, halfoftotal );
Half2 := CHOOSEN(address_file, halfoftotal+1, halfoftotal+1 );
Half1; // returns 1,2
Half2; // returns 3,4,5
Of course, assuming you're on a cluster with more than one node, you'll also want to add DISTRIBUTE() to each OUTPUT to disk so each file part has records (otherwise the file parts for the first half will be empty on the second half's nodes, and vice versa).
HTH,
Richard
- rtaylor
- Community Advisory Board Member
- Posts: 1619
- Joined: Wed Oct 26, 2011 7:40 pm
Nice!! Yeah, I noticed I was getting that one last record in the beginning of the next dataset. In the documentation for choosen it shows "NextFive := CHOOSEN(Person,5,6); // returns next 5 recs from Person" that number six should have meant a +1 for this. Thank you for clarifying that.
- mo0926
- Posts: 18
- Joined: Thu Jan 09, 2020 3:57 pm
mo0926,
And here's an even easier way to do it:
HTH,
Richard
And here's an even easier way to do it:
- Code: Select all
Half1 := address_file[1 .. halfoftotal];
Half2 := address_file[halfoftotal+1 ..];
Half1;
Half2;
HTH,
Richard
- rtaylor
- Community Advisory Board Member
- Posts: 1619
- Joined: Wed Oct 26, 2011 7:40 pm
Thanks for the advice! I've ended up with tons of .csv files after the data gathering part of an investment simulation in this Budapest real estate was over, and had to split them to make them easier to process by different members of my team, as having to deal with a larger file that contains unneeded information was difficult and ineffective.
And now, I've managed to divide them thanks to your code, and my job has become much simpler.
And now, I've managed to divide them thanks to your code, and my job has become much simpler.
- McPP82
- Posts: 6
- Joined: Wed Sep 30, 2020 5:02 pm
10 posts
• Page 1 of 1
Who is online
Users browsing this forum: No registered users and 1 guest