Thu May 19, 2022 2:49 pm
Login Register Lost Password? Contact Us

Sorting a child dataset

Questions around writing code and queries

Thu Dec 30, 2021 8:46 pm Change Time Zone

I have a file which is approximately 6.5 billion records (and growing) - each record has 70+ fields with a child dataset (a 4 field record occurring up to 18 times).
I can PROJECT the file into a shorter layout (10 fields and the child dataset) then DISTRIBUTE. I can then SORT the records so they are into groups. However, if I attempt to SORT the child dataset, I get a SYSTEM 4 failure. When I look at the graph, the source read has exploded well over 30+ billion (adding the up to 18 reads per record) until THOR aborts the job.

Say the child dataset has a layout of:
{string4 code, unsigned4 cost1_limit, unsigned4 cost2_limit}
so say I have:
PREV.RECORD ChilDS[{'AA', 50000, 100000}, {'BB', 20000, 50000} ] and a
CURR.RECORD ChilDS[{'BB', 20000, 50000}, {'AA', 50000, 100000}]

I could not compare PREV.RECORD.ChilDS = CURR.RECORD ChilDS as true. Even though they contain the same values, their order is different. If I had SORTed on code, then it would have tested true.

Given the size of the source file, is there a different way to SORT the contents of the child dataset?
John Meier
Posts: 20
Joined: Wed Jun 29, 2016 7:45 pm

Mon Jan 03, 2022 3:26 pm Change Time Zone

I have found a solution: the GROUP function.

I first PROJECT the 6.5+billion records into the smaller layout, then DISTRIBUTE the data by the attributes that would cluster related data together on the same node. I then do a LOCAL SORT and GROUP. Now I can do a LOCAL PROJECT where the TRANSFORM sorts the child dataset. It finished in 6:10.004
John Meier
Posts: 20
Joined: Wed Jun 29, 2016 7:45 pm

Tue Jan 04, 2022 6:42 pm Change Time Zone


Glad you found a solution. :)

Community Advisory Board Member
Community Advisory Board Member
Posts: 1619
Joined: Wed Oct 26, 2011 7:40 pm

Return to Programming

Who is online

Users browsing this forum: No registered users and 1 guest