Wed Dec 08, 2021 10:00 am
Login Register Lost Password? Contact Us


Intermittent long despray operation

Questions related to node architecture, redundancy and system monitoring

Mon Apr 01, 2013 2:46 pm Change Time Zone

I have some code that desprays a small recordset in XML format to my cluster's landing zone. The recordset is very small -- the resulting XML is 30K -- but it's the result of a 180-way Thor process running across nine physical nodes. Usually, the despray process takes only a second or two to complete. Intermittently, it takes much longer (1-2 minutes).

Here is an excerpt from /var/log/HPCCSystems/mydfuserver/dfuserver.log:

Code: Select all
0000220F 2013-04-01 09:10:07 57462 57474 "Gathering 180 file sizes on 13 threads"
00002210 2013-04-01 09:10:17 57462 57474 "WARNING: Waiting for file: //10.210.40.11/var/lib/HPCCSystems/hpcc-data/thor/tmp/id_bar_presence_of_data_rep
ort._180_of_180"
00002211 2013-04-01 09:10:27 57462 57474 "WARNING: Waiting for file: //10.210.40.19/var/lib/HPCCSystems/hpcc-data/thor/tmp/id_bar_presence_of_data_rep
ort._179_of_180"
00002212 2013-04-01 09:10:37 57462 57474 "WARNING: Waiting for file: //10.210.40.18/var/lib/HPCCSystems/hpcc-data/thor/tmp/id_bar_presence_of_data_rep
ort._178_of_180"
00002213 2013-04-01 09:10:47 57462 57474 "WARNING: Waiting for file: //10.210.40.15/var/lib/HPCCSystems/hpcc-data/thor/tmp/id_bar_presence_of_data_rep
ort._166_of_180"
00002214 2013-04-01 09:10:57 57462 57474 "WARNING: Waiting for file: //10.210.40.11/var/lib/HPCCSystems/hpcc-data/thor/tmp/id_bar_presence_of_data_rep
ort._162_of_180"
00002215 2013-04-01 09:11:02 57462 57474 "Finished gathering file sizes..."

These warnings, with the associated 10-second delay, appears to be the source of the intermittent problem.

Anecdotally, it seems that the delay pops up whenever the despray operation has not run recently. If I turn around and run this process again, it will typically fly right along.

How can I debug this issue? Any recommendations on perhaps a configuration parameter I could change? I noticed in the log that there are 13 threads allocated to gathering file sizes. Can that be increased? Or is that the wrong thing to be looking at?

Thanks,

Dan
DSC
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 568
Joined: Tue Oct 18, 2011 4:45 pm

Tue Apr 09, 2013 6:59 pm Change Time Zone

Any ideas?

My earlier assertion about the "first despray of the day exhibits the problem" appears to be true. Reliably, the first execution of the code in the morning coughs up the warnings and subsequent executions display no warnings.

Dan
DSC
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 568
Joined: Tue Oct 18, 2011 4:45 pm

Wed Apr 10, 2013 7:53 am Change Time Zone

It's not possible that your disks have spun down is it?
richardkchapman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 110
Joined: Fri Jun 17, 2011 8:59 am

Wed Apr 10, 2013 8:58 am Change Time Zone

It also might be worth checking the DaFileSrv logs on the target nodes at the corresponding times, to see if any warnings there.
jsmith
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 81
Joined: Tue Jul 19, 2011 12:58 pm

Wed Apr 10, 2013 12:58 pm Change Time Zone

Thanks, Jake and Richard. I did some checking along your suggestions.

The drives aren't configured to spin down, and I doubt they would be idle at this point even if they were so configured. The warning I saw was at the tail end of a Thor process that exercised all the nodes and that process would spin them up, I would think. If nothing else, the warning was citing an attempted read of a temporary file, so that means the file would have been flushed to disk, forcing a spin up. (Probably.)

/var/log/HPCCSystems/mydafilesrv files on the target nodes show only usual activity: Ping-style entries every 10 minutes, then what I think are normal entries showing this read activity. Specifically, here is an excerpt from the first slave node matching the activity I originally posted:

Code: Select all
00004D54 2013-04-01 09:08:09 22185 22186 "SYS: PU=  3% MU= 17% MAL=182656 MMP=0 SBK=182656 TOT=1320K RAM=8825280K SWP=19108K"
00004D55 2013-04-01 09:08:09 22185 22186 "DSK: [sda] r/s=0.0 kr/s=0.0 w/s=2.2 kw/s=63.6 bsy=0 NIC: rxp/s=180.2 rxk/s=130.4 txp/s=134.8 txk/s=19.7 CPU: usr=2 sys=0 iow=0 idle=96"
00004D56 2013-04-01 09:10:09 22185 44712 "Connect from 10.210.40.10:48526"
00004D57 2013-04-01 09:10:51 22185 44712 "Connect from 10.210.40.10:48553"
00004D58 2013-04-01 09:11:08 22185 44712 "Connect from 10.210.40.10:48635"
00004D59 2013-04-01 09:11:08 22185 44712 "Connect from 10.210.40.10:48642"
00004D5A 2013-04-01 09:11:08 22185 45680 "Connect from 10.210.40.10:48645"
00004D5B 2013-04-01 09:18:09 22185 22186 "SYS: PU=  3% MU= 19% MAL=183360 MMP=0 SBK=183360 TOT=1320K RAM=9615812K SWP=19108K"
00004D5C 2013-04-01 09:18:09 22185 22186 "DSK: [sda] r/s=0.0 kr/s=0.8 w/s=2.3 kw/s=65.0 bsy=0 NIC: rxp/s=181.3 rxk/s=130.8 txp/s=137.2 txk/s=19.9 CPU: usr=2 sys=0 iow=0 idle=96"

Looking at this got me thinking about the overall flow of getting remote file sizes. I extracted the "Connect from" log entries from all the slave nodes and sorted them according to log entry timestamp. In the process I discovered that my node-9 has a ntpd issue and its clock is off. I compensated for that in what follows:

Code: Select all
2013-04-01 09:09:47 60097 18614 "Connect from 10.210.40.10:42813"   node-9
2013-04-01 09:10:09 22185 44712 "Connect from 10.210.40.10:48526"   node-2
2013-04-01 09:10:10 60097 18614 "Connect from 10.210.40.10:42823"   node-9
2013-04-01 09:10:11 63375 25048 "Connect from 10.210.40.10:52436"   node-10
2013-04-01 09:10:16 26667 49692 "Connect from 10.210.40.10:54560"   node-8
2013-04-01 09:10:22 40007 63390 "Connect from 10.210.40.10:38660"   node-7
2013-04-01 09:10:23 16627 39653 "Connect from 10.210.40.10:36105"   node-6
2013-04-01 09:10:28 36284 59662 "Connect from 10.210.40.10:41697"   node-5
2013-04-01 09:10:32  4149 29054 "Connect from 10.210.40.10:59902"   node-4
2013-04-01 09:10:36 60097 18614 "Connect from 10.210.40.10:42925"   node-9
2013-04-01 09:10:36 60097 18614 "Connect from 10.210.40.10:42937"   node-9
2013-04-01 09:10:38 12046 34983 "Connect from 10.210.40.10:59092"   node-3
2013-04-01 09:10:41 63375 25048 "Connect from 10.210.40.10:52448"   node-10
2013-04-01 09:10:46 26667 49692 "Connect from 10.210.40.10:54572"   node-8
2013-04-01 09:10:47 36284 59662 "Connect from 10.210.40.10:41705"   node-5
2013-04-01 09:10:51 22185 44712 "Connect from 10.210.40.10:48553"   node-2
2013-04-01 09:10:53 40007 63390 "Connect from 10.210.40.10:38681"   node-7
2013-04-01 09:10:56 16627 39653 "Connect from 10.210.40.10:36126"   node-6
2013-04-01 09:10:59  4149 29054 "Connect from 10.210.40.10:59921"   node-4
2013-04-01 09:11:02 12046 34983 "Connect from 10.210.40.10:59112"   node-3
2013-04-01 09:11:08  4149 29054 "Connect from 10.210.40.10:60005"   node-4
2013-04-01 09:11:08  4149 29054 "Connect from 10.210.40.10:60017"   node-4
2013-04-01 09:11:08 12046 34983 "Connect from 10.210.40.10:59195"   node-3
2013-04-01 09:11:08 12046 34983 "Connect from 10.210.40.10:59198"   node-3
2013-04-01 09:11:08 16627 39653 "Connect from 10.210.40.10:36207"   node-6
2013-04-01 09:11:08 16627 39653 "Connect from 10.210.40.10:36210"   node-6
2013-04-01 09:11:08 22185 44712 "Connect from 10.210.40.10:48635"   node-2
2013-04-01 09:11:08 22185 44712 "Connect from 10.210.40.10:48642"   node-2
2013-04-01 09:11:08 22185 45680 "Connect from 10.210.40.10:48645"   node-2
2013-04-01 09:11:08 26667 49692 "Connect from 10.210.40.10:54676"   node-8
2013-04-01 09:11:08 26667 49692 "Connect from 10.210.40.10:54678"   node-8
2013-04-01 09:11:08 26667 49692 "Connect from 10.210.40.10:54679"   node-8
2013-04-01 09:11:08 26667 49692 "Connect from 10.210.40.10:54681"   node-8
2013-04-01 09:11:08 36284 59662 "Connect from 10.210.40.10:41799"   node-5
2013-04-01 09:11:08 36284 59662 "Connect from 10.210.40.10:41809"   node-5
2013-04-01 09:11:08 36284 59662 "Connect from 10.210.40.10:41812"   node-5
2013-04-01 09:11:08 40007 63390 "Connect from 10.210.40.10:38761"   node-7
2013-04-01 09:11:08 40007 63390 "Connect from 10.210.40.10:38780"   node-7
2013-04-01 09:11:08 40007 64354 "Connect from 10.210.40.10:38781"   node-7
2013-04-01 09:11:08 63375 25048 "Connect from 10.210.40.10:52541"   node-10
2013-04-01 09:11:08 63375 25048 "Connect from 10.210.40.10:52546"   node-10

If find this timestamp pattern interesting, if not particularly informative, when you consider that 13 threads were supposedly running simultaneously.

Cheers,

Dan
DSC
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 568
Joined: Tue Oct 18, 2011 4:45 pm

Thu Apr 11, 2013 12:58 pm Change Time Zone

Here is a slightly clearer version of the timestamp information I tried to supply yesterday. This time I also made absolutely sure that all clocks were synchronized to within a tenth of a second of each other.

First, the warnings that appear in dfuserver.log:

Code: Select all
0000727D 2013-04-11 07:33:29 50016 50028 "Gathering 180 file sizes on 13 threads"
0000727E 2013-04-11 07:33:39 50016 50028 "WARNING: Waiting for file: //10.210.40.11/var/lib/HPCCSystems/hpcc-data/thor/tmp/id_bar_records_by_industry_report._180_of_180"
0000727F 2013-04-11 07:33:49 50016 50028 "WARNING: Waiting for file: //10.210.40.19/var/lib/HPCCSystems/hpcc-data/thor/tmp/id_bar_records_by_industry_report._179_of_180"
00007280 2013-04-11 07:33:59 50016 50028 "WARNING: Waiting for file: //10.210.40.18/var/lib/HPCCSystems/hpcc-data/thor/tmp/id_bar_records_by_industry_report._178_of_180"
00007281 2013-04-11 07:34:09 50016 50028 "WARNING: Waiting for file: //10.210.40.13/var/lib/HPCCSystems/hpcc-data/thor/tmp/id_bar_records_by_industry_report._164_of_180"
00007282 2013-04-11 07:34:19 50016 50028 "WARNING: Waiting for file: //10.210.40.18/var/lib/HPCCSystems/hpcc-data/thor/tmp/id_bar_records_by_industry_report._160_of_180"
00007283 2013-04-11 07:34:29 50016 50028 "WARNING: Waiting for file: //10.210.40.12/var/lib/HPCCSystems/hpcc-data/thor/tmp/id_bar_records_by_industry_report._163_of_180"
00007284 2013-04-11 07:34:37 50016 50028 "Finished gathering file sizes..."

Next, the corresponding connection logs from the slave nodes. These are sorted by timestamps and the node's IP address is appended to each line to make it easier to match up with the warning entries.

Code: Select all
2013-04-11 07:33:33 22185 44712 "Connect from 10.210.40.10:33767" 10.210.40.11
2013-04-11 07:33:36 63375 25048 "Connect from 10.210.40.10:37677" 10.210.40.19
2013-04-11 07:33:39 60097 18614 "Connect from 10.210.40.10:56286" 10.210.40.18
2013-04-11 07:33:43 26667 49692 "Connect from 10.210.40.10:39802" 10.210.40.17
2013-04-11 07:33:47 16627 39653 "Connect from 10.210.40.10:49578" 10.210.40.15
2013-04-11 07:33:52 40007 63390 "Connect from 10.210.40.10:52135" 10.210.40.16
2013-04-11 07:33:55 36284 59662 "Connect from 10.210.40.10:55171" 10.210.40.14
2013-04-11 07:33:56  4149 29054 "Connect from 10.210.40.10:45142" 10.210.40.13
2013-04-11 07:33:57 12046 34983 "Connect from 10.210.40.10:44332" 10.210.40.12
2013-04-11 07:33:58 22185 44712 "Connect from 10.210.40.10:33779" 10.210.40.11
2013-04-11 07:34:02 63375 25048 "Connect from 10.210.40.10:37689" 10.210.40.19
2013-04-11 07:34:03 60097 18614 "Connect from 10.210.40.10:56297" 10.210.40.18
2013-04-11 07:34:09 26667 50654 "Connect from 10.210.40.10:39814" 10.210.40.17
2013-04-11 07:34:10 40007 63390 "Connect from 10.210.40.10:52145" 10.210.40.16
2013-04-11 07:34:15 16627 39653 "Connect from 10.210.40.10:49590" 10.210.40.15
2013-04-11 07:34:17 36284 59662 "Connect from 10.210.40.10:55182" 10.210.40.14
2013-04-11 07:34:21 12046 34983 "Connect from 10.210.40.10:44342" 10.210.40.12
2013-04-11 07:34:22 63375 25048 "Connect from 10.210.40.10:37697" 10.210.40.19
2013-04-11 07:34:24 40007 63390 "Connect from 10.210.40.10:52150" 10.210.40.16
2013-04-11 07:34:28 26667 49692 "Connect from 10.210.40.10:39821" 10.210.40.17
2013-04-11 07:34:32  4149 29054 "Connect from 10.210.40.10:45158" 10.210.40.13
2013-04-11 07:34:35 36284 61189 "Connect from 10.210.40.10:55189" 10.210.40.14
2013-04-11 07:34:37  4149 29054 "Connect from 10.210.40.10:45160" 10.210.40.13
2013-04-11 07:35:00  4149 29054 "Connect from 10.210.40.10:45218" 10.210.40.13
2013-04-11 07:35:00  4149 29054 "Connect from 10.210.40.10:45236" 10.210.40.13
2013-04-11 07:35:00 12046 34983 "Connect from 10.210.40.10:44408" 10.210.40.12
2013-04-11 07:35:00 12046 34983 "Connect from 10.210.40.10:44419" 10.210.40.12
2013-04-11 07:35:00 12046 34983 "Connect from 10.210.40.10:44424" 10.210.40.12
2013-04-11 07:35:00 16627 39653 "Connect from 10.210.40.10:49665" 10.210.40.15
2013-04-11 07:35:00 16627 39653 "Connect from 10.210.40.10:49672" 10.210.40.15
2013-04-11 07:35:00 16627 39653 "Connect from 10.210.40.10:49673" 10.210.40.15
2013-04-11 07:35:00 22185 44712 "Connect from 10.210.40.10:33863" 10.210.40.11
2013-04-11 07:35:00 22185 44712 "Connect from 10.210.40.10:33864" 10.210.40.11
2013-04-11 07:35:00 26667 49692 "Connect from 10.210.40.10:39886" 10.210.40.17
2013-04-11 07:35:00 26667 49692 "Connect from 10.210.40.10:39889" 10.210.40.17
2013-04-11 07:35:00 36284 59662 "Connect from 10.210.40.10:55250" 10.210.40.14
2013-04-11 07:35:00 36284 61189 "Connect from 10.210.40.10:55252" 10.210.40.14
2013-04-11 07:35:00 40007 63390 "Connect from 10.210.40.10:52212" 10.210.40.16
2013-04-11 07:35:00 40007 63390 "Connect from 10.210.40.10:52226" 10.210.40.16
2013-04-11 07:35:00 60097 18614 "Connect from 10.210.40.10:56375" 10.210.40.18
2013-04-11 07:35:00 60097 18614 "Connect from 10.210.40.10:56377" 10.210.40.18
2013-04-11 07:35:00 63375 25048 "Connect from 10.210.40.10:37766" 10.210.40.19
2013-04-11 07:35:00 63375 25048 "Connect from 10.210.40.10:37775" 10.210.40.19

And just to prove that I'm not crazy, I submitted the job again and here is the snippet from dfuserver.log:

Code: Select all
0000759A 2013-04-11 07:44:36 50016 50028 "Gathering 180 file sizes on 13 threads"
0000759B 2013-04-11 07:44:36 50016 50028 "Finished gathering file sizes..."

Just from looking at these timestamps, I would guess that the node running dfuserver is the source of the problems, whatever those are. Most telling is the delay between the "Gathering..." log entry and the first connection entry from a slave node. The dfuserver node is a decent hardware node running everything except Thor and Roxie. It's frankly bored most of the time.

Dan
DSC
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 568
Joined: Tue Oct 18, 2011 4:45 pm


Return to System Health

Who is online

Users browsing this forum: No registered users and 1 guest