Sorting a child dataset
I have a file which is approximately 6.5 billion records (and growing) - each record has 70+ fields with a child dataset (a 4 field record occurring up to 18 times).
I can PROJECT the file into a shorter layout (10 fields and the child dataset) then DISTRIBUTE. I can then SORT the records so they are into groups. However, if I attempt to SORT the child dataset, I get a SYSTEM 4 failure. When I look at the graph, the source read has exploded well over 30+ billion (adding the up to 18 reads per record) until THOR aborts the job.
Say the child dataset has a layout of:
{string4 code, unsigned4 cost1_limit, unsigned4 cost2_limit}
so say I have:
PREV.RECORD ChilDS[{'AA', 50000, 100000}, {'BB', 20000, 50000} ] and a
CURR.RECORD ChilDS[{'BB', 20000, 50000}, {'AA', 50000, 100000}]
I could not compare PREV.RECORD.ChilDS = CURR.RECORD ChilDS as true. Even though they contain the same values, their order is different. If I had SORTed on code, then it would have tested true.
Given the size of the source file, is there a different way to SORT the contents of the child dataset?
I can PROJECT the file into a shorter layout (10 fields and the child dataset) then DISTRIBUTE. I can then SORT the records so they are into groups. However, if I attempt to SORT the child dataset, I get a SYSTEM 4 failure. When I look at the graph, the source read has exploded well over 30+ billion (adding the up to 18 reads per record) until THOR aborts the job.
Say the child dataset has a layout of:
{string4 code, unsigned4 cost1_limit, unsigned4 cost2_limit}
so say I have:
PREV.RECORD ChilDS[{'AA', 50000, 100000}, {'BB', 20000, 50000} ] and a
CURR.RECORD ChilDS[{'BB', 20000, 50000}, {'AA', 50000, 100000}]
I could not compare PREV.RECORD.ChilDS = CURR.RECORD ChilDS as true. Even though they contain the same values, their order is different. If I had SORTed on code, then it would have tested true.
Given the size of the source file, is there a different way to SORT the contents of the child dataset?
- John Meier
- Posts: 20
- Joined: Wed Jun 29, 2016 7:45 pm
I have found a solution: the GROUP function.
I first PROJECT the 6.5+billion records into the smaller layout, then DISTRIBUTE the data by the attributes that would cluster related data together on the same node. I then do a LOCAL SORT and GROUP. Now I can do a LOCAL PROJECT where the TRANSFORM sorts the child dataset. It finished in 6:10.004
I first PROJECT the 6.5+billion records into the smaller layout, then DISTRIBUTE the data by the attributes that would cluster related data together on the same node. I then do a LOCAL SORT and GROUP. Now I can do a LOCAL PROJECT where the TRANSFORM sorts the child dataset. It finished in 6:10.004
- John Meier
- Posts: 20
- Joined: Wed Jun 29, 2016 7:45 pm
John,
Glad you found a solution.
Richard
Glad you found a solution.

Richard
- rtaylor
- Community Advisory Board Member
- Posts: 1619
- Joined: Wed Oct 26, 2011 7:40 pm
3 posts
• Page 1 of 1
Who is online
Users browsing this forum: No registered users and 1 guest