Wed Dec 12, 2018 9:54 pm
Login Register Lost Password? Contact Us


Spraying using DFUPlus

Comments or questions on structuring and organizing your data

Fri Aug 03, 2012 11:13 am Change Time Zone

Hello,

I have a 3-node HPCC set-up. The different nodes and the processes running on them are shown below :

Code: Select all
root@cloudx-767-700:~# service hpcc-init status
    mydafilesrv     ( pid    21286 ) is running...
    mydfuserver     ( pid     6558 ) is running...
    myeclagent      ( pid     6639 ) is running...
    myeclccserver   ( pid     6720 ) is running...
    myesp           ( pid     6800 ) is running...
    mysasha         ( pid     6883 ) is running...


Code: Select all
root@cloudx-798-730:~# service hpcc-init status
    mydafilesrv     ( pid    30555 ) is running...
    myroxie         ( pid    31107 ) is running...


Code: Select all
root@cloudx-799-731:~# sudo service hpcc-init status
    mydafilesrv     ( pid    10293 ) is running...
    mydali          ( pid    10856 ) is running...
    myeclscheduler  ( pid    10963 ) is running...
    mythor          ( pid    16028 ) is running...


I am trying to spray a file using DFUPlus on the cloudx-767-700 I.P is 172.25.37.10

Code: Select all
root@cloudx-767-700:~# dfuplus action=spray srcip=172.25.37.10 srcfile=/var/lib/HPCCSystems/mydropzone/Emp.csv dstname=ankita::poc::dfuplus::sprayed dstcluster=mythor prefix=FILENAME,FILESIZE nosplit=1 server=http://172.25.37.10:8010 format=csv username=root password=newuser_123 overwrite=1 replicate=1
Checking for local Dali File Server

Spraying from /var/lib/HPCCSystems/mydropzone/Emp.csv on 172.25.37.10:7100 to ankita::poc::dfuplus::sprayed
Submitted WUID D20120803-215221
D20120803-215221 status: queued
D20120803-215221 Finished
Total time taken 1 secs


The csv file is as follows :

Code: Select all
Name,PsNo,BU,Designation,addr
Prachi,10602210,COE,SET,Vashi
Ankita,10602192,MFG3,SET,Powai-II


Under DFU Workunits->Browse, the D20120803-215221 is shown as finished.

When I de-sprayed the file using the ECL Watch and DFU, the re-constructed file is showing junk values; also the size of this file is greater than the original csv that I uploaded.

My queries are :

1. In the above command, I have not specified record length anywhere - is this an issue? What if a huge data file, say of 10GB is to be sprayed?
2. The original csv is getting sprayed and de-sprayed correctly using ECL Watch - am I missing any steps?
3. How does HPCC ensure that the file is 'each single record is always whole and complete on a single node' - what if I upload a flat file of huge size whose structure I don't know? Or one 'record' runs to multiple lines of the file?

Thanks and regards !
Ankita Singla
 
Posts: 21
Joined: Tue Jul 24, 2012 7:02 am

Fri Aug 03, 2012 2:19 pm Change Time Zone

Ankita,
1. In the above command, I have not specified record length anywhere - is this an issue? What if a huge data file, say of 10GB is to be sprayed?
Since you did not explicitly set the maxrecordsize option, it defaults to 8192. If you have records larger than 8192, then you must set the maxrecordsize option to whatever is appropriate (I have seen 10000000000 successfully used before).
2. The original csv is getting sprayed and de-sprayed correctly using ECL Watch - am I missing any steps?
I just tried a CSV spray and despray, and my files were all correct in my test. Try it again with a different file, documenting exactly what you do each step along the way so that, if you still see anomalous results, you can accurately report your exact process (which would help in trying to figure out what could be going wrong).
3. How does HPCC ensure that the file is 'each single record is always whole and complete on a single node'
In the case of CSV and XML files, by dividing the file using the record delimiters.
what if I upload a flat file of huge size whose structure I don't know?
As you would need to do in any other data processing environment, you would first need to explore the file to determine its structure (or ask the data provider).
Or one 'record' runs to multiple lines of the file?
I believe I answered that here: http://hpccsystems.com/bb/viewtopic.php?f=8&t=473&sid=91a5b96a6eea55fbf263bf4f30a3b436

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1396
Joined: Wed Oct 26, 2011 7:40 pm

Tue Jan 27, 2015 8:36 am Change Time Zone

One thing I noticed is that your code generates an INTEGER10 data type for your ISBN field -- you should modify it to make that a DECIMAL10 instead, since INTEGER10 is not a legal data type (the '10' portion defines the number of bytes the field occupies, not the number of digits in the number, so the range of valid values is only 1 through 8).
___________________

ahemd
ahmedvu153
 
Posts: 1
Joined: Tue Jan 27, 2015 8:34 am


Return to Data Modeling

Who is online

Users browsing this forum: No registered users and 1 guest