Sun Aug 14, 2022 4:30 am
Login Register Lost Password? Contact Us

Please Note: The HPCC Systems forums are moving to Stack Overflow. We invite you to post your questions on Stack Overflow utilizing the tag hpcc-ecl (https://stackoverflow.com/search?tab=newest&q=hpcc-ecl). This legacy forum will be active and monitored during our transition to Stack Overflow but will become read only beginning September 1, 2022.



How to optimize elapsed time to read CSV File - Amazon AWS

Questions or comments related to Cloud Computing and the HPCC Systems Instant Cloud for AWS

Sat Oct 27, 2012 4:38 pm Change Time Zone

I am running a proof-of-concept on Amazon (one click thor / 10 thor nodes / no roxie nodes) that reads CSV files of varying sizes and then outputs it back to the disk.

Sample code as follows:

Code: Select all
SomeRecordStruct := RECORD

INTEGER ID;
// 50 snippets of varying sizes of XML in each field
STRING XMLSnippet1;
STRING XMLSnippet2;
...
...
STRING XMLSnippet50;
END;

Some_Recordset := DATASET('~thor::db::Sample.CSV',SomeRecordStruct,CSV);

output(Some_Recordset,,'~thor::db::output_test.CSV',CSV,OVERWRITE);


The logical file is perfectly/ evenly distributed across 10 nodes. Typically, it takes the following elapsed times :

Size of logical file/Elapsed time for dataset read:

> 60 GB/7 minutes
> 200 GB/ 20 minutes
> 320 GB/ 35 minutes

I also tried running the same test on extra large instance with EBS optimized but I did not get appreciably better results. I can post them here as a follow up.

1. Are the above numbers typical of the type of elapsed times to read a file?
2. Are there ways of speeding up on AWS One click thor (default large instance)?
3. Is this because of the inherent characteristic (I/O and network throughput) of the nature of Amazon AWS? (I have not tried running on a dedicated environment yet to compare results).

If the experts could shed light on the above, it would be very helpful.

Thanks
Arun
arunarav
 
Posts: 20
Joined: Fri Sep 21, 2012 5:34 pm

Tue Oct 30, 2012 12:40 pm Change Time Zone

Hi Arun,

I'm curious, have you looked at the timings in the ECL Watch and tried to identify the process that eats up the time?

If this is simply an OUTPUT of a file to the cluster, I'm not sure that anything can be done and I think that point number 3 is the reason as you stated.

I will throw this question to our "Instant Cloud" team and post back if I have additional information.

Regards,

Bob
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1006
Joined: Wed Jun 29, 2011 7:13 pm

Tue Oct 30, 2012 12:59 pm Change Time Zone

Also, which realm are you using? US-West uses newer hardware and may produce better results.

Regards,

Bob
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1006
Joined: Wed Jun 29, 2011 7:13 pm

Tue Oct 30, 2012 1:21 pm Change Time Zone

Bob,

> have you looked at the timings in the ECL Watch and tried to identify the process that eats up the time?

I've attached a screenshot of ECL Watch in the link below which shows the CSV read as consuming 8+ mins out of the total ~11 minutes. The CSV write operation takes about 3 minutes.

https://www.dropbox.com/s/tagyzpebruxay ... 20Read.png


> which realm are you using?

Oregon (US West)

Thanks
Arun
arunarav
 
Posts: 20
Joined: Fri Sep 21, 2012 5:34 pm

Sat Nov 03, 2012 1:38 pm Change Time Zone

Any advice or input on this query would be appreciated since we need to explain this behavior.
arunarav
 
Posts: 20
Joined: Fri Sep 21, 2012 5:34 pm


Return to Cloud

Who is online

Users browsing this forum: No registered users and 1 guest

cron