Tue Aug 21, 2018 2:56 am
Login Register Lost Password? Contact Us


Performance with Dfuserver

Topics related to recommendations or questions on the design for HPCC Systems clusters

Tue Feb 18, 2014 4:57 pm Change Time Zone

Hi -
It seems like dfuserver takes a while at the very beginning to scan large files to identify the offsets for each node. Of course, this involves I/O and network time and can naturally take a while.

I'm wondering if searching for terminators (and separators) in quoted strings forces dfuserver to do a full scan on the file?
If so... does dfuserver have an option to indicate "quoted terminators" don't exist in the in-coming file - which would allow dfuserver to do a more streamlined generation of offsets (for a 10-node cluster, seek to 10%, find the next terminator, have offset... repeat...).

Thanks.
jwilt
 
Posts: 50
Joined: Wed Feb 27, 2013 7:46 pm

Fri Sep 19, 2014 6:02 pm Change Time Zone

Hi,

In early May 2014 it is implemented in HPCC 5.0.

Attila
AttilaV
 
Posts: 14
Joined: Fri Sep 19, 2014 3:55 pm

Thu May 28, 2015 8:51 pm Change Time Zone

Bumping this thread...

I see in the source code for the DFU server that there is something called "QuickPartitioner", which seems like the implementation asked about in the OP. How do I take advantage of this? Is there an argument to STD.File.SprayVariable or something? A flag to dfuplus?
alex
 
Posts: 38
Joined: Wed Feb 25, 2015 4:06 pm

Fri May 29, 2015 2:36 am Change Time Zone

Is the fix referred to above the "quotedTerminator" option in dfuplus?

The dfuplus usage statement shows:

spray options:
...
options for csv/delimited:
...
quotedTerminator=1|0 -- optional, default is 1 (quoted terminators in rows)

Thanks again.
jwilt
 
Posts: 50
Joined: Wed Feb 27, 2013 7:46 pm


Return to Clustering

Who is online

Users browsing this forum: No registered users and 1 guest

cron