Mon Jun 01, 2020 10:21 pm
Login Register Lost Password? Contact Us


Skew and Distribution

Questions around writing code and queries

Fri May 15, 2020 3:00 pm Change Time Zone

Hi All

I'm having a few issues getting the most of our platform due to Skew issues. I'm hoping someone can give me a few pointers. Having read various forum posts and blogs on Distribute and joins I've taken one of our main graphs down from 5 minutes to 1 minute, however, the work unit analyser is telling my I still have skew issues mainly around write times to various persists.

I have attached two screen shots of a graph where the work unit analyser suggests we have skew in writing. Where I'm confused is the left side of the graph looks to be distributed well and I'm doing a lookup join from the right (I've tried a local join from the right but this causes further skew and slows the process)

I've distributed the data on its Primary key and also the UserId which is actually more beneficial later on but slows the graph down significantly. I've tried a basic distribute with no hash which also gives a worse skew.

I'm confused as to what I should be looking at to see the skew in the main graph or the hover data which tells a completely different picture. Even if I understood the hover data I'm not sure what else I can distribute the data with to get better performance.

Significant skew in child records causes uneven disk write time (w5:graph3:sg15:a19)

Screenshot 2020-05-15 at 15.39.23.png
Graph
(91.06 KiB) Not downloaded yet


Screenshot 2020-05-15 at 15.45.30.png
Hover Data
(224.12 KiB) Not downloaded yet
David Dasher
 
Posts: 56
Joined: Tue Feb 18, 2014 9:17 am

Mon May 18, 2020 5:02 pm Change Time Zone

David,
Significant skew in child records causes uneven disk write time (w5:graph3:sg15:a19)
I added the emphasis, because that word is what I think might be the problem. It suggests to me that you're working with a nested child dataset.

If that's the case, then it appears to me that it's very possible the skew numbers shown in the graph may be being calculated just using parent record counts, while the workunit analyzer may be doing a more complex analysis using the actual amount of data in the nested child dataset. IFF that's the case, that would explain the discrepancy between the two.

But, if you're not using nested child datasets, then we'll have to explore further. Either way, this issue is worth submitting a JIRA ticket so the developers can decide which is right(er) -- the WU Analyzer or the graph.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1535
Joined: Wed Oct 26, 2011 7:40 pm

Mon May 18, 2020 6:56 pm Change Time Zone

Hi Richard

I can confirm that I am not using a nested child dataset. I will submit a Jira, however the work unit analyser does show more suggestions if I'm doing a lookup other than other joins.

Speed wise I have it about 38 seconds on a left only lookup join using 1.8 billion rows on the left and 1.2 million on the right using a 144 node cluster, to be fair I'm pretty happy with that but if I can squeeze more, then cool.

Thanks for replying

David
David Dasher
 
Posts: 56
Joined: Tue Feb 18, 2014 9:17 am

Mon May 18, 2020 7:20 pm Change Time Zone

David,
a left only lookup join
OK, that makes a difference, because the skew numbers you're seeing in the graph are on the input records to the Lookup Join and the analyzer warning is on the disk write (where no skew numbers are shown in the graph), so the skewed data is the result of the LEFT ONLY JOIN (as in, just the non-matches).

I don't see any way of predicting or controlling that beforehand. You could try throwing a DISTRIBUTE around that JOIN so the PERSIST would be on the re-distributed result, but if that doesn't work I wouldn't worry too much about it.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1535
Joined: Wed Oct 26, 2011 7:40 pm

Mon May 18, 2020 7:35 pm Change Time Zone

Thanks that makes sense Richard.

One further question from your example then if you don't mind.

To get the best/near perfect distribution I'm doing a hash32 of the ID of that table before it even gets there so it persists the distribution after the left lookup join, however, if I distribute it on the ClientId (who can have many rows) the distribution goes to about 10+ / -10 and gives me slower performance taking me to about 1 min 30 seconds instead of 38 seconds. Further down the code I'm doing some other joins and and dedups ,which because of the large number would be better doing locally on the client id but I'm always stuck with the worse distribution and overall slower performance, so it's actually been better to just do lookup joins and distribute where necessary.

Would you say you should distribute for the best distribution or the best logical distribution according to how you're going to use the data further down the line even if the performance is slightly worse? Or does speed top trump everything else? (In this case it does for me but it just feels messy).

Thanks

David
David Dasher
 
Posts: 56
Joined: Tue Feb 18, 2014 9:17 am

Tue May 19, 2020 11:42 am Change Time Zone

David,

Speed is almost always paramount, and a ~3X difference (38 vs 90 seconds) is always worth it. So I would keep the speed, in this circumstance. But if later code needs a different distribution for optimal performance, then you should do that, too.

IOW, Instead of this:
Code: Select all
JOIN(...,LOOKUP,LEFT OUTER)
try this:
Code: Select all
DISTRIBUTE(JOIN(...,LOOKUP,LEFT OUTER),HASH32(ClientID))
so that ypur PERSIST writes the data in a less skewed manner that's already set up for your later operation.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1535
Joined: Wed Oct 26, 2011 7:40 pm

Tue May 19, 2020 12:12 pm Change Time Zone

Thanks Richard

D
David Dasher
 
Posts: 56
Joined: Tue Feb 18, 2014 9:17 am


Return to Programming

Who is online

Users browsing this forum: No registered users and 1 guest

cron