Fri Aug 17, 2018 3:13 am
Login Register Lost Password? Contact Us


First load

Post questions or comments on how best to manage your big data problem

Thu Apr 20, 2017 8:10 pm Change Time Zone

Dear HPCC Systems Team,

I would like some advice on that very first time when I load sizable data into the cluster.

1. I'm talking about one 1TB file here. Would you recommend splitting that file in the first place before any of the methods below?

2. Is any of the following methods better when loading big files?

a. DATASET(std.File.ExternalLogicalFilename(LandingZone_IP, SomeFilePath ), MyLayout, CSV)
b. Std.File.SprayDelimited( ... )

Is one better than the other? Any other method you would recommend?

3. I run analysis on certain datasets, then shut down the cluster, and some time later bring up another cluster, load data, run different analysis, etc.
The problem I'm facing right now is that every time I need to load all that data into my cluster and it takes a lot of time. I wish I could maintain some of that data somewhere and "just restore it" (as in a simple linux cp command vs. another "Spray") for the next cluster. Is that possible? Any recommendations here?


Thank you for your help!
Luke.
lpezet
 
Posts: 53
Joined: Wed Sep 10, 2014 3:14 am

Fri Apr 21, 2017 1:26 pm Change Time Zone

Luke,

If you're using AWS for your clusters, then they have a feature called "snapshots" that allow you to save the data from the cluster you bring down and automatically load it into the next cluster you bring up.

Other than that, I would simply suggest working with flat files instead of CSV, since they are generally more efficient.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1369
Joined: Wed Oct 26, 2011 7:40 pm

Fri Apr 21, 2017 3:11 pm Change Time Zone

Thank you Richard.
lpezet
 
Posts: 53
Joined: Wed Sep 10, 2014 3:14 am

Fri Apr 21, 2017 5:21 pm Change Time Zone

When Spraying, I've notived dfuserver/dafilesrv taking a while (like hours, if not days) before actually sending file content to slaves on cluster.
I see in the dfuserver logs "findSplitPoint( ... )" and some percentages in there before getting to the bunch of "Transferring part...." part.
Any details someone can share as to how dfuserver/dafilesrv work when spraying a file?
Can I spray multiple files at the same time or would the first spray block the rest (like a Thor job if I'm not mistaken)?

Just thinking it would help me figure out maybe specs for my "spraying" node ;)
lpezet
 
Posts: 53
Joined: Wed Sep 10, 2014 3:14 am

Fri Apr 21, 2017 6:00 pm Change Time Zone

Luke,

Spray is a "dumb" operation. The only requirement it has is to get the data as fast as possible to the nodes while ensuring that a single record never spans multiple nodes -- each record must be whole and complete on a single node.

Let's say you have a 3 Gb file being sprayed to a 3-node cluster:
  • If you're spraying a fixed-length record flat file, then spray can just do the math to determine the exact size of each "chunk" to put on each node: node 1 gets the first Gb, node 2 the second, and node 3 the third.
  • However, when spraying any of the variable-length record formats (CSV, XML, JSON ...) it has to actually look around the 1Gb area to find a record delimiter before it can determine the exact size of "chunk" to go to node 1. Then it has to repeat that for each subsequent node ...
This is part of the reason I suggested using flat files instead of CSV.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1369
Joined: Wed Oct 26, 2011 7:40 pm


Return to Managing Big Data

Who is online

Users browsing this forum: No registered users and 1 guest

cron