Thu Oct 21, 2021 4:51 am
Login Register Lost Password? Contact Us


watchdog has lost contact with Thor slave

Comments and questions related to the Enterprise Control Language

Fri Oct 21, 2011 9:40 pm Change Time Zone

I am running the following query
and I always get the following error

Error><source>eclagent</source><code>10056</code><message>System error: 10056: Watchdog has lost contact with Thor slave: 10.127.xxx.xxx:6600

I have checked the slave and is up and running. There is enough memory and disk space

import $, std;

ResourceRecord := RECORD, MAXLENGTH(8192)
STRING ip;
INTEGER rid;
STRING dns;
INTEGER volume;
END;

STRING rr_in_dir := '~rr';
STRING rr_file1 := 'rrsets_20110801';
STRING rr_file2 := 'rrsets_20110802';
STRING rr_file3 := 'rrsets_20110803';
STRING rr_file4 := 'rrsets_20110804';
STRING rr_file5 := 'rrsets_20110805';

STRING rr_file := 'rrsets_201108';
STRING rr_out_dir := 'rr_files_processed';


// all the resource record data
rdata1 := DATASET(rr_in_dir+'::'+rr_file1, ResourceRecord,
CSV(MAXLENGTH(8192), SEPARATOR(['\t', ' '])));
rdata2 := DATASET(rr_in_dir+'::'+rr_file2, record_defs.ResourceRecord,
CSV(MAXLENGTH(8192), SEPARATOR(['\t', ' '])));
rdata3 := DATASET(rr_in_dir+'::'+rr_file3, ResourceRecord,
CSV(MAXLENGTH(8192), SEPARATOR(['\t', ' '])));
rdata4 := DATASET(rr_in_dir+'::'+rr_file4, ResourceRecord,
CSV(MAXLENGTH(8192), SEPARATOR(['\t', ' '])));
rdata5 := DATASET(rr_in_dir+'::'+rr_file5, ResourceRecord,
CSV(MAXLENGTH(8192), SEPARATOR(['\t', ' '])));

rdata:=DISTRIBUTE(rdata1+rdata2+rdata3+rdata4+rdata5);

//OUTPUT(COUNT(rdata));

rdata_clean := rdata(IsIP(rdata.ip));
OUTPUT(rdata_clean,,rr_out_dir+'::'+rr_file+'_clean');

BOOLEAN IsIP(STRING x) := FUNCTION
words:=std.Str.SplitWords(x, '.');
BOOLEAN is4words := IF(COUNT(words)!=4, false,
true);
Rec :=RECORD
STRING str;
END;
ds:=DATASET(words, Rec);
valid_words:=ds((INTEGER)str>=0 AND (INTEGER)str<=255);
result:=IF(is4words AND COUNT(valid_words)=4, true, false);
return result;
END;


END;
nvasil
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 105
Joined: Mon Oct 17, 2011 6:48 pm

Mon Oct 24, 2011 12:53 pm Change Time Zone

I can't see anything jump out in the code that would cause a delay, perhaps if you posted the THOR log we could see what is causing the delay and loss of contact.

Regards,

Bob
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1005
Joined: Wed Jun 29, 2011 7:13 pm

Mon Oct 24, 2011 3:16 pm Change Time Zone

Where do I find the log file?
Is it in the esp server on the slave that has the problem?
nvasil
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 105
Joined: Mon Oct 17, 2011 6:48 pm

Mon Oct 24, 2011 3:27 pm Change Time Zone

Where do I find the log file?
Is it in the esp server on the slave that has the problem?

In the ECL Watch from the ECL IDE, there should be a log in the helpers section, and then the slave log would be found by opening the ECL Watch in a browser, selecting target clusters/mythor and then clicking on the disk icon to drill down to the appropriate slave log. When you open the target slave log you can download and then post it here.

Hope this helps,

Bob
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1005
Joined: Wed Jun 29, 2011 7:13 pm

Mon Oct 24, 2011 3:37 pm Change Time Zone

I am not using the IDE, I work on ubuntu and I still haven't been able to make it work. By the way the fact that the development tools are in windows is the biggest pain for HPCC, if you are using it for deploying systems.

Eventually I went through the slave and I found this. It is complaining about a memory leak :-)

00000000 2011-10-24 14:51:35 7564 7564 Opened log file //10.222.195.159/mnt/HPCCSystems/mythor/10_24_2011_14_51_34/THORSLAVE.10.222.195.159_6600.log
00000001 2011-10-24 14:51:35 7564 7564 Build community_3.2.0-2
00000002 2011-10-24 14:51:35 7564 7564 calling initClientProcess
00000003 2011-10-24 14:51:35 7564 7564 registering 10.222.195.159:6600 - master 10.90.213.34:6500
00000004 2011-10-24 14:51:36 7564 7564 Initialization received
00000005 2011-10-24 14:51:36 7564 7564 Registration confirmation sent
00000006 2011-10-24 14:51:36 7564 7564 verifying mp connection to rest of cluster
00000007 2011-10-24 14:51:36 7564 7564 verified mp connection to rest of cluster
00000008 2011-10-24 14:51:36 7564 7564 registered 10.222.195.159:6600
00000009 2011-10-24 14:51:36 7564 7564 CThorRowManager initialized, memlimit = 2147483648
0000000A 2011-10-24 14:51:36 7564 7564 ThorSlave Version LCR - 4.0 started
0000000B 2011-10-24 14:51:36 7564 7564 Slave 10.222.195.159:6600 - thor_tmp_dir set to : /mnt/HPCCSystems/mythor/temp/
0000000C 2011-10-24 14:51:36 7564 7564 Using querySo directory: /mnt/HPCCSystems/queries/mythor
0000000D 2011-10-24 14:51:36 7564 7564 FileCache: limit = 1800, purgeN = 10
0000000E 2011-10-24 14:51:36 7564 7574 priority set id=140538337232640 policy=0 pri=0 PID=7564
0000000F 2011-10-24 14:51:36 7564 7575 Watchdog: thread running
00000010 2011-10-24 15:00:31 7564 7564 Using query: /mnt/HPCCSystems/queries/mythor/V179191376_libW20111024-150030.so
00000011 2011-10-24 15:00:31 7564 7564 New Graph started : graph1
00000012 2011-10-24 15:00:31 7564 7564 temp directory cleared
00000013 2011-10-24 15:00:31 7564 7564 Disk space: /mnt/HPCCSystems/hpcc-data/thor = 389773, /mnt/HPCCSystems/hpcc-mirror/thor = 0
00000014 2011-10-24 15:00:31 7564 7564 Key file cache size set to: 6
00000015 2011-10-24 15:00:31 7564 7564 GraphInit: W20111024-150030graph1
00000016 2011-10-24 15:00:31 7564 7564 deserializeMPTag: tag = 65542
00000017 2011-10-24 15:00:31 7564 7564 deserializeMPTag: tag = 65539
00000018 2011-10-24 15:00:31 7564 7564 deserializeMPTag: tag = 65540
00000019 2011-10-24 15:00:31 7564 7564 deserializeMPTag: tag = 65541
0000001A 2011-10-24 15:00:31 7564 7564 Add: Launching graph thread for graphId=1
0000001B 2011-10-24 15:00:31 7564 7590 Running graph [global] : <graph>
<node id="2" label="Csv Read&#10;&apos;...::rrsets_20110801&apos;">
<att name="definition" value="dedupnotos.ecl(21,1)"/>
<att name="name" value="rdata1"/>
<att name="_kind" value="99"/>
<att name="ecl" value="DATASET(&apos;~rr::rrsets_20110801&apos;, resourcerecord, CSV(maxlength(8192), separator([&apos;\t&apos;, &apos; &apos;])));&#10;"/>
<att name="recordSize" value="24..8192(536)"/>
<att name="recordCount" value="0..?[disk]"/>
<att name="_fileName" value="~rr::rrsets_20110801"/>
</node>
<node id="3" label="Csv Read&#10;&apos;...::rrsets_20110802&apos;">
<att name="definition" value="dedupnotos.ecl(23,1)"/>
<att name="name" value="rdata2"/>
<att name="_kind" value="99"/>
<att name="ecl" value="DATASET(&apos;~rr::rrsets_20110802&apos;, resourcerecord, CSV(maxlength(8192), separator([&apos;\t&apos;, &apos; &apos;])));&#10;"/>
<att name="recordSize" value="24..8192(536)"/>
<att name="recordCount" value="0..?[disk]"/>
<att name="_fileName" value="~rr::rrsets_20110802"/>
</node>
<node id="4" label="Csv Read&#10;&apos;...::rrsets_20110803&apos;">
<att name="definition" value="dedupnotos.ecl(25,1)"/>
<att name="name" value="rdata3"/>
<att name="_kind" value="99"/>
<att name="ecl" value="DATASET(&apos;~rr::rrsets_20110803&apos;, resourcerecord, CSV(maxlength(8192), separator([&apos;\t&apos;, &apos; &apos;])));&#10;"/>
<att name="recordSize" value="24..8192(536)"/>
<att name="recordCount" value="0..?[disk]"/>
<att name="_fileName" value="~rr::rrsets_20110803"/>
</node>
<node id="5" label="Csv Read&#10;&apos;...::rrsets_20110804&apos;">
<att name="definition" value="dedupnotos.ecl(27,1)"/>
<att name="name" value="rdata4"/>
<att name="_kind" value="99"/>
<att name="ecl" value="DATASET(&apos;~rr::rrsets_20110804&apos;, resourcerecord, CSV(maxlength(8192), separator([&apos;\t&apos;, &apos; &apos;])));&#10;"/>
<att name="recordSize" value="24..8192(536)"/>
<att name="recordCount" value="0..?[disk]"/>
<att name="_fileName" value="~rr::rrsets_20110804"/>
</node>
<node id="6" label="Csv Read&#10;&apos;...::rrsets_20110805&apos;">
<att name="definition" value="dedupnotos.ecl(29,1)"/>
<att name="name" value="rdata5"/>
<att name="_kind" value="99"/>
<att name="ecl" value="DATASET(&apos;~rr::rrsets_20110805&apos;, resourcerecord, CSV(maxlength(8192), separator([&apos;\t&apos;, &apos; &apos;])));&#10;"/>
<att name="recordSize" value="24..8192(536)"/>
<att name="recordCount" value="0..?[disk]"/>
<att name="_fileName" value="~rr::rrsets_20110805"/>
</node>
<node id="7" label="Funnel">
<att name="definition" value="dedupnotos.ecl(32,1)"/>
<att name="name" value="rdata"/>
<att name="_kind" value="22"/>
<att name="ecl" value="rdata1 + rdata2 + rdata3 + rdata4 + rdata5;&#10;"/>
<att name="recordSize" value="24..8192(536)"/>
<att name="recordCount" value="0..?[disk]"/>
</node>
<node id="8" label="Filter">
<att name="definition" value="dedupnotos.ecl(37,1)"/>
<att name="name" value="rdata_clean"/>
<att name="_kind" value="5"/>
<att name="ecl" value="FILTER(COUNT(std.str.dosplitwords(src, separator, std.str.calcwordsetsize(src, separator))) = ...);&#10;"/>
<att name="recordSize" value="24..8192(536)"/>
<att name="recordCount" value="0..?[disk]"/>
</node>
<node id="9" label="Disk Write&#10;&apos;rr_files_processed::rrsets_201108_clean&apos;">
<att name="definition" value="dedupnotos.ecl(1,1)"/>
<att name="name" value="dedupnotos"/>
<att name="definition" value="dedupnotos.ecl(38,1)"/>
<att name="_kind" value="2"/>
<att name="ecl" value="OUTPUT(..., , &apos;rr_files_processed::rrsets_201108_clean&apos;);&#10;"/>
<att name="recordSize" value="24..8192(536)"/>
<att name="_fileName" value="rr_files_processed::rrsets_201108_clean"/>
</node>
<att name="rootGraph" value="1"/>
<edge id="2_0" source="2" target="7"/>
<edge id="3_0" source="3" target="7">
<att name="_targetIndex" value="1"/>
</edge>
<edge id="4_0" source="4" target="7">
<att name="_targetIndex" value="2"/>
</edge>
<edge id="5_0" source="5" target="7">
<att name="_targetIndex" value="3"/>
</edge>
<edge id="6_0" source="6" target="7">
<att name="_targetIndex" value="4"/>
</edge>
<edge id="7_0" source="7" target="8"/>
<edge id="8_0" source="8" target="9"/>
</graph>
- graph(graph1, 1)
0000001C 2011-10-24 15:00:31 7564 7590 CONNECTING (id=2, idx=0) to (id=7, idx=0) - activity(funnel, 7)
0000001D 2011-10-24 15:00:31 7564 7590 CONNECTING (id=3, idx=0) to (id=7, idx=1) - activity(funnel, 7)
0000001E 2011-10-24 15:00:31 7564 7590 CONNECTING (id=4, idx=0) to (id=7, idx=2) - activity(funnel, 7)
0000001F 2011-10-24 15:00:31 7564 7590 CONNECTING (id=5, idx=0) to (id=7, idx=3) - activity(funnel, 7)
00000020 2011-10-24 15:00:31 7564 7590 CONNECTING (id=6, idx=0) to (id=7, idx=4) - activity(funnel, 7)
00000021 2011-10-24 15:00:31 7564 7590 CONNECTING (id=7, idx=0) to (id=8, idx=0) - activity(filter, 8)
00000022 2011-10-24 15:00:31 7564 7590 CONNECTING (id=8, idx=0) to (id=9, idx=0) - activity(diskwrite, 9)
00000023 2011-10-24 15:00:31 7564 7590 FUNNEL mode = PARALLEL, grouped=UNGROUPED - activity(funnel, 7)
00000024 2011-10-24 15:00:31 7564 7590 Watchdog: Start Job 1
00000025 2011-10-24 15:00:31 7564 7591 handling fname : /mnt/HPCCSystems/hpcc-data/thor/thor/rr_files_processed/rrsets_201108_clean._2_of_2 - activity(diskwrite, 9)
00000026 2011-10-24 15:00:31 7564 7591 Starting input - activity(diskwrite, 9)
00000027 2011-10-24 15:00:31 7564 7591 Starting input - activity(filter, 8)
00000028 2011-10-24 15:00:31 7564 7592 csvread[part=1]: reading physical file '/mnt/HPCCSystems/hpcc-data/thor/rr/rrsets_20110801._2_of_2' (logical file = ~rr::rrsets_20110801) - activity(csvread, 2)
00000029 2011-10-24 15:00:31 7564 7592 csvread[part=1]: Base offset to 2486348791 - activity(csvread, 2)
0000002A 2011-10-24 15:00:31 7564 7592 ITDL starting for output 0 - activity(csvread, 2)
0000002B 2011-10-24 15:00:31 7564 7594 csvread[part=1]: reading physical file '/mnt/HPCCSystems/hpcc-data/thor/rr/rrsets_20110803._2_of_2' (logical file = ~rr::rrsets_20110803) - activity(csvread, 4)
0000002C 2011-10-24 15:00:31 7564 7594 csvread[part=1]: Base offset to 2542483820 - activity(csvread, 4)
0000002D 2011-10-24 15:00:31 7564 7593 csvread[part=1]: reading physical file '/mnt/HPCCSystems/hpcc-data/thor/rr/rrsets_20110802._2_of_2' (logical file = ~rr::rrsets_20110802) - activity(csvread, 3)
0000002E 2011-10-24 15:00:31 7564 7593 csvread[part=1]: Base offset to 2480747096 - activity(csvread, 3)
0000002F 2011-10-24 15:00:31 7564 7591 ITDL starting for output 0 - activity(funnel, 7)
00000030 2011-10-24 15:00:31 7564 7591 ITDL starting for output 0 - activity(filter, 8)
00000031 2011-10-24 15:00:31 7564 7595 csvread[part=1]: reading physical file '/mnt/HPCCSystems/hpcc-data/thor/rr/rrsets_20110804._2_of_2' (logical file = ~rr::rrsets_20110804) - activity(csvread, 5)
00000032 2011-10-24 15:00:31 7564 7594 ITDL starting for output 0 - activity(csvread, 4)
00000034 2011-10-24 15:00:31 7564 7593 ITDL starting for output 0 - activity(csvread, 3)
00000035 2011-10-24 15:00:31 7564 7591 Writing to file: /mnt/HPCCSystems/hpcc-data/thor/thor/rr_files_processed/thtmp7564_1__partial.tmp - activity(diskwrite, 9)
00000036 2011-10-24 15:00:31 7564 7591 Created output stream for /mnt/HPCCSystems/hpcc-data/thor/thor/rr_files_processed/rrsets_201108_clean._2_of_2 - activity(diskwrite, 9)
00000037 2011-10-24 15:00:31 7564 7591 Ungrouped - activity(diskwrite, 9)
00000033 2011-10-24 15:00:31 7564 7595 csvread[part=1]: Base offset to 2478564725 - activity(csvread, 5)
00000038 2011-10-24 15:00:31 7564 7595 ITDL starting for output 0 - activity(csvread, 5)
00000039 2011-10-24 15:00:31 7564 7592 Record size (max) = 8192 - activity(csvread, 2)
0000003A 2011-10-24 15:00:31 7564 7594 Record size (max) = 8192 - activity(csvread, 4)
0000003B 2011-10-24 15:00:31 7564 7591 Record size (max) = 8192 - activity(funnel, 7)
0000003C 2011-10-24 15:00:31 7564 7591 Record size (max) = 8192 - activity(filter, 8)
0000003D 2011-10-24 15:00:31 7564 7596 csvread[part=1]: reading physical file '/mnt/HPCCSystems/hpcc-data/thor/rr/rrsets_20110805._2_of_2' (logical file = ~rr::rrsets_20110805) - activity(csvread, 6)
0000003E 2011-10-24 15:00:31 7564 7596 csvread[part=1]: Base offset to 2395982404 - activity(csvread, 6)
0000003F 2011-10-24 15:00:31 7564 7596 ITDL starting for output 0 - activity(csvread, 6)
00000040 2011-10-24 15:00:31 7564 7593 Record size (max) = 8192 - activity(csvread, 3)
00000041 2011-10-24 15:00:31 7564 7596 Record size (max) = 8192 - activity(csvread, 6)
00000042 2011-10-24 15:00:31 7564 7595 Record size (max) = 8192 - activity(csvread, 5)
00000043 2011-10-24 15:01:31 7564 7589 SYS: PU= 30% MU= 6% MAL=6980688 MMP=6602752 SBK=377936 TOT=7016K RAM=1001084K SWP=0K
00000044 2011-10-24 15:02:31 7564 7589 SYS: PU= 30% MU= 6% MAL=6997216 MMP=6602752 SBK=394464 TOT=7016K RAM=1020716K SWP=0K
00000045 2011-10-24 15:02:31 7564 7589 DSK: [sdb] r/s=0.1 kr/s=0.4 w/s=343.4 kw/s=14684.0 bsy=27 NIC: rxp/s=0.1 rxk/s=0.0 txp/s=0.1 txk/s=0.0 CPU: usr=14 sys=8 iow=8 idle=69
00000046 2011-10-24 15:03:31 7564 7589 SYS: PU= 30% MU= 6% MAL=6997216 MMP=6602752 SBK=394464 TOT=7016K RAM=1038516K SWP=0K
00000047 2011-10-24 15:03:31 7564 7589 DSK: [sdb] r/s=0.1 kr/s=0.4 w/s=208.4 kw/s=8738.5 bsy=17 NIC: rxp/s=0.1 rxk/s=0.0 txp/s=0.1 txk/s=0.0 CPU: usr=16 sys=9 iow=3 idle=69
00000048 2011-10-24 15:04:31 7564 7589 SYS: PU= 28% MU= 6% MAL=6997216 MMP=6602752 SBK=394464 TOT=7016K RAM=1032512K SWP=0K
00000049 2011-10-24 15:04:31 7564 7589 DSK: [sdb] r/s=19.3 kr/s=809.5 w/s=346.2 kw/s=14829.5 bsy=30 NIC: rxp/s=0.1 rxk/s=0.0 txp/s=0.1 txk/s=0.0 CPU: usr=14 sys=8 iow=5 idle=71
0000004A 2011-10-24 15:05:31 7564 7589 SYS: PU= 29% MU= 6% MAL=6997216 MMP=6602752 SBK=394464 TOT=7016K RAM=1028748K SWP=0K
0000004B 2011-10-24 15:05:31 7564 7589 DSK: [sdb] r/s=61.3 kr/s=2586.5 w/s=248.1 kw/s=10461.6 bsy=25 NIC: rxp/s=0.1 rxk/s=0.0 txp/s=0.2 txk/s=0.0 CPU: usr=14 sys=9 iow=5 idle=70
0000004C 2011-10-24 15:06:31 7564 7589 SYS: PU= 32% MU= 6% MAL=6997216 MMP=6602752 SBK=394464 TOT=7016K RAM=1020944K SWP=0K
0000004D 2011-10-24 15:06:31 7564 7589 DSK: [sdb] r/s=114.9 kr/s=4848.1 w/s=219.9 kw/s=9301.1 bsy=26 NIC: rxp/s=0.0 rxk/s=0.0 txp/s=0.1 txk/s=0.0 CPU: usr=15 sys=9 iow=8 idle=67
0000004E 2011-10-24 15:06:38 7564 7564 GraphAbort: W20111024-150030graph1
0000004F 2011-10-24 15:06:39 7564 7564 Abort condition set - activity(diskwrite, 9)
00000050 2011-10-24 15:06:39 7564 7564 Abort condition set - activity(filter, 8)
00000051 2011-10-24 15:06:39 7564 7564 Abort condition set - activity(funnel, 7)
00000052 2011-10-24 15:06:39 7564 7564 Abort condition set - activity(csvread, 2)
00000053 2011-10-24 15:06:39 7564 7564 Abort condition set - activity(csvread, 6)
00000054 2011-10-24 15:06:39 7564 7564 Abort condition set - activity(csvread, 5)
00000055 2011-10-24 15:06:39 7564 7564 Abort condition set - activity(csvread, 4)
00000056 2011-10-24 15:06:39 7564 7564 Abort condition set - activity(csvread, 3)
00000057 2011-10-24 15:06:39 7564 7591 Wrote 46387785 records, crc=0x195616B6 - activity(diskwrite, 9)
00000058 2011-10-24 15:06:39 7564 7591 Stopping input for - activity(diskwrite, 9)
00000059 2011-10-24 15:06:39 7564 7591 Stopping input for - activity(filter, 8)
0000005A 2011-10-24 15:06:39 7564 7594 ITDL output 0 stopped, count was 12513581 - activity(csvread, 4)
0000005C 2011-10-24 15:06:39 7564 7593 ITDL output 0 stopped, count was 12653054 - activity(csvread, 3)
0000005D 2011-10-24 15:06:39 7564 7592 ITDL output 0 stopped, count was 12227349 - activity(csvread, 2)
0000005E 2011-10-24 15:06:39 7564 7596 ITDL output 0 stopped, count was 12385313 - activity(csvread, 6)
0000005F 2011-10-24 15:06:39 7564 7596 funnel(4): Read 12385313 records - activity(funnel, 7) [ecl=rdata1 + rdata2 + rdata3 + rdata4 + rdata5;]
00000060 2011-10-24 15:06:39 7564 7593 funnel(1): Read 12653054 records - activity(funnel, 7) [ecl=rdata1 + rdata2 + rdata3 + rdata4 + rdata5;]
0000005B 2011-10-24 15:06:39 7564 7595 ITDL output 0 stopped, count was 12501224 - activity(csvread, 5)
00000061 2011-10-24 15:06:39 7564 7592 funnel(0): Read 12227349 records - activity(funnel, 7) [ecl=rdata1 + rdata2 + rdata3 + rdata4 + rdata5;]
00000062 2011-10-24 15:06:39 7564 7595 funnel(3): Read 12501224 records - activity(funnel, 7) [ecl=rdata1 + rdata2 + rdata3 + rdata4 + rdata5;]
00000063 2011-10-24 15:06:39 7564 7594 funnel(2): Read 12513581 records - activity(funnel, 7) [ecl=rdata1 + rdata2 + rdata3 + rdata4 + rdata5;]
00000064 2011-10-24 15:06:39 7564 7591 ITDL output 0 stopped, count was 62259797 - activity(funnel, 7)
00000065 2011-10-24 15:06:39 7564 7591 ITDL output 0 stopped, count was 46387785 - activity(filter, 8)
00000066 2011-10-24 15:06:39 7564 7590 Graph wait cancelled, aborted=true - graph(graph1, 1)
00000067 2011-10-24 15:06:39 7564 7590 End of sub-graph - graph(graph1, 1)
00000068 2011-10-24 15:06:39 7564 7590 Watchdog: Stop Job 1
00000069 2011-10-24 15:06:39 7564 7590 Socket statistics : connects=1
connecttime=472us
failedconnects=0
failedconnecttime=0us
reads=46
readtime=785us
readsize=142343 bytes
writes=87
writetime=2944us
writesize=7605 bytes
activesockets=5
numblockrecvs=0
numblocksends=0
blockrecvsize=0
blocksendsize=0
blockrecvtime=0
blocksendtime=0
longestblocksend=0
longestblocksize=0
- graph(graph1, 1)
0000006A 2011-10-24 15:06:39 7564 7590 Graph Done - graph(graph1, 1)
0000006B 2011-10-24 15:06:39 7564 7590 PU= 28% MU= 6% MAL=2500112 MMP=2113536 SBK=386576 TOT=2632K RAM=1015248K SWP=0K DSK: [sdb] r/s=136.3 kr/s=5749.8 w/s=6.6 kw/s=26.3 bsy=6 NIC: rxp/s=4.4 rxk/s=0.6 txp/s=4.9 txk/s=0.7 CPU: usr=17 sys=10 iow=1 idle=71 - graph(graph1, 1)
0000006C 2011-10-24 15:06:39 7564 7590 CGraphExecutor running=0, waitingToRun=0, dependentsWaiting=0
0000006D 2011-10-24 15:06:39 7564 7564 QueryDone, removing W20111024-150030graph1 from jobs
0000006E 2011-10-24 15:06:39 7564 7564 Job ended : graph1
0000006F 2011-10-24 15:06:39 7564 7564 destroying ProcessSlaveActivity - activity(diskwrite, 9)
00000070 2011-10-24 15:06:39 7564 7564 ProcessSlaveActivity : joining process thread - activity(diskwrite, 9)
00000071 2011-10-24 15:06:39 7564 7564 AFTER ProcessSlaveActivity : joining process thread - activity(diskwrite, 9)
00000072 2011-10-24 15:06:39 7564 7564 DESTROYED - activity(diskwrite, 9)
00000073 2011-10-24 15:06:39 7564 7564 DESTROYED - activity(filter, 8)
00000074 2011-10-24 15:06:39 7564 7564 DESTROYED - activity(csvread, 2)
00000075 2011-10-24 15:06:39 7564 7564 DESTROYED - activity(csvread, 3)
00000076 2011-10-24 15:06:39 7564 7564 DESTROYED - activity(csvread, 4)
00000077 2011-10-24 15:06:39 7564 7564 DESTROYED - activity(csvread, 5)
00000078 2011-10-24 15:06:39 7564 7564 DESTROYED - activity(csvread, 6)
00000079 2011-10-24 15:06:39 7564 7564 DESTROYED - activity(funnel, 7)
0000007A 2011-10-24 15:06:39 7564 7564 CJobBase resetting memory manager
0000007B 2011-10-24 15:06:39 7564 7564 JMALLOC LEAKCHECKING: 1 leaks, total memory 82
0000007C 2011-10-24 15:06:39 7564 7564 JMALLOC OSBLOCKS: 4, total memory 4194304
0000007D 2011-10-24 15:06:39 7564 7564 CThorRowManager initialized, memlimit = 2147483648
0000007E 2011-10-24 15:06:39 7564 7564 QueryDone, removed W20111024-150030graph1 from jobs
0000007F 2011-10-24 15:06:40 7564 7564 Stopped watchdog
nvasil
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 105
Joined: Mon Oct 17, 2011 6:48 pm

Mon Oct 24, 2011 3:53 pm Change Time Zone

Hmmm, I was hoping the log would show the posted error (Watchdog has lost contact with Thor slave) but I don't see it there. Wondering if you posted the correct log. I have forwarded this to the development team for review.
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1005
Joined: Wed Jun 29, 2011 7:13 pm

Mon Oct 24, 2011 4:09 pm Change Time Zone

This is the log of the slave who failed, there is clearly a JMALLOC error. Here is the log from the master.

00000000 2011-10-24 15:25:06 31976 31976 Opened log file //10.90.213.34/mnt/HPCCSystems/mythor/10_24_2011_15_25_05/THORM
ASTER.log
00000001 2011-10-24 15:25:06 31976 31976 Build community_3.2.0-2
00000002 2011-10-24 15:25:06 31976 31976 calling initClientProcess Port 6500
00000003 2011-10-24 15:25:06 31976 31976 Checking cluster replicate nodes
00000004 2011-10-24 15:25:06 31976 31976 Cluster replicate nodes check completed in 2ms
00000005 2011-10-24 15:25:06 31976 31976 Starting watchdog
00000006 2011-10-24 15:25:06 31976 31990 Started watchdog
00000007 2011-10-24 15:25:06 31976 31976 ThorMaster version 4.0, Started on 10.90.213.34:6500
00000008 2011-10-24 15:25:06 31976 31976 CThorRowManager initialized, memlimit = 2147483648
00000009 2011-10-24 15:25:06 31976 31976 Thor name = mythor, queue = mythor.thor, nodeGroup = mythor
0000000A 2011-10-24 15:25:06 31976 31976 Creating sentinel file thor.sentinel for rerun from script
0000000B 2011-10-24 15:25:06 31976 31976 Waiting for 2 slaves to register
0000000C 2011-10-24 15:25:06 31976 31976 Verifying connection to slave 2
0000000D 2011-10-24 15:25:06 31976 31976 verified connection with 10.222.195.159:6600
0000000E 2011-10-24 15:25:06 31976 31976 Verifying connection to slave 1
0000000F 2011-10-24 15:25:06 31976 31976 verified connection with 10.93.95.117:6600
00000010 2011-10-24 15:25:06 31976 31976 Slaves connected, initializing..
00000011 2011-10-24 15:25:06 31976 31976 Initialization sent to slave group
00000012 2011-10-24 15:25:06 31976 31976 Registration confirmation from 10.93.95.117:6600
00000013 2011-10-24 15:25:06 31976 31976 Slave 1 (10.93.95.117:6600) registered
00000014 2011-10-24 15:25:06 31976 31976 Registration confirmation from 10.222.195.159:6600
00000015 2011-10-24 15:25:06 31976 31976 Slave 2 (10.222.195.159:6600) registered
00000016 2011-10-24 15:25:06 31976 31976 Slaves initialized
00000017 2011-10-24 15:25:06 31976 31976 verifying mp connection to rest of cluster
00000018 2011-10-24 15:25:06 31976 31976 verified mp connection to rest of cluster
00000019 2011-10-24 15:25:06 31976 31976 ,Progress,Thor,Startup,mythor,mythor,mythor.thor,//10.90.213.34/mnt/HPCCSystems
/mythor/10_24_2011_15_25_05/THORMASTER.log
0000001A 2011-10-24 15:25:06 31976 31976 Listening for graph
0000001B 2011-10-24 15:25:06 31976 31976 ThorLCR(10.90.213.34:6500) available, waiting on queue thor.thor
0000001C 2011-10-24 15:26:06 31976 31989 SYS: PU= 0% MU= 4% MAL=220800 MMP=0 SBK=220800 TOT=264K RAM=673472K SWP=0K
0000001D 2011-10-24 15:27:06 31976 31989 SYS: PU= 0% MU= 4% MAL=237328 MMP=0 SBK=237328 TOT=264K RAM=670156K SWP=0K
0000001E 2011-10-24 15:27:06 31976 31989 DSK: [sdb] r/s=0.0 kr/s=0.0 w/s=2.4 kw/s=9.7 bsy=0 NIC: rxp/s=18.4 rxk/s=2.5 tx
p/s=16.2 txk/s=2.5 CPU: usr=0 sys=0 iow=0 idle=99
0000001F 2011-10-24 15:28:06 31976 31989 SYS: PU= 0% MU= 4% MAL=237328 MMP=0 SBK=237328 TOT=264K RAM=674784K SWP=0K
00000020 2011-10-24 15:28:06 31976 31989 DSK: [sdb] r/s=0.0 kr/s=0.0 w/s=2.5 kw/s=9.8 bsy=0 NIC: rxp/s=25.9 rxk/s=3.3 tx
p/s=21.5 txk/s=9.9 CPU: usr=0 sys=0 iow=0 idle=99
00000021 2011-10-24 15:29:06 31976 31989 SYS: PU= 0% MU= 4% MAL=237328 MMP=0 SBK=237328 TOT=264K RAM=675024K SWP=0K
00000022 2011-10-24 15:29:06 31976 31989 DSK: [sdb] r/s=0.0 kr/s=0.0 w/s=2.7 kw/s=10.7 bsy=0 NIC: rxp/s=17.9 rxk/s=2.5 t
xp/s=15.9 txk/s=2.5 CPU: usr=0 sys=0 iow=0 idle=99
00000023 2011-10-24 15:30:06 31976 31989 SYS: PU= 0% MU= 4% MAL=237328 MMP=0 SBK=237328 TOT=264K RAM=675256K SWP=0K
00000024 2011-10-24 15:30:06 31976 31989 DSK: [sdb] r/s=0.0 kr/s=0.0 w/s=2.3 kw/s=9.3 bsy=0 NIC: rxp/s=17.9 rxk/s=2.5 tx
p/s=15.9 txk/s=2.5 CPU: usr=0 sys=0 iow=0 idle=100
00000025 2011-10-24 15:31:06 31976 31989 SYS: PU= 0% MU= 4% MAL=237328 MMP=0 SBK=237328 TOT=264K RAM=675496K SWP=0K
00000026 2011-10-24 15:31:06 31976 31989 DSK: [sdb] r/s=0.0 kr/s=0.0 w/s=2.5 kw/s=10.0 bsy=0 NIC: rxp/s=18.3 rxk/s=2.5 t
xp/s=16.2 txk/s=2.5 CPU: usr=0 sys=0 iow=0 idle=99
00000027 2011-10-24 15:32:06 31976 31989 SYS: PU= 0% MU= 4% MAL=237328 MMP=0 SBK=237328 TOT=264K RAM=675432K SWP=0K
00000028 2011-10-24 15:32:06 31976 31989 DSK: [sdb] r/s=0.0 kr/s=0.0 w/s=2.6 kw/s=10.5 bsy=0 NIC: rxp/s=17.9 rxk/s=2.5 t
xp/s=15.9 txk/s=2.5 CPU: usr=0 sys=0 iow=0 idle=99
00000029 2011-10-24 15:33:06 31976 31989 SYS: PU= 0% MU= 4% MAL=237328 MMP=0 SBK=237328 TOT=264K RAM=675512K SWP=0K
0000002A 2011-10-24 15:33:06 31976 31989 DSK: [sdb] r/s=0.0 kr/s=0.0 w/s=2.3 kw/s=9.4 bsy=0 NIC: rxp/s=17.9 rxk/s=2.5 tx
p/s=15.9 txk/s=2.5 CPU: usr=0 sys=0 iow=0 idle=99
0000002B 2011-10-24 15:34:06 31976 31989 SYS: PU= 0% MU= 4% MAL=237328 MMP=0 SBK=237328 TOT=264K RAM=675444K SWP=0K
0000002C 2011-10-24 15:34:06 31976 31989 DSK: [sdb] r/s=0.0 kr/s=0.1 w/s=2.6 kw/s=10.3 bsy=0 NIC: rxp/s=18.1 rxk/s=2.5 t
xp/s=16.1 txk/s=2.5 CPU: usr=0 sys=0 iow=0 idle=99
0000002D 2011-10-24 15:35:06 31976 31989 SYS: PU= 0% MU= 4% MAL=237328 MMP=0 SBK=237328 TOT=264K RAM=675368K SWP=0K
0000002E 2011-10-24 15:35:06 31976 31989 DSK: [sdb] r/s=0.0 kr/s=0.0 w/s=2.4 kw/s=9.7 bsy=0 NIC: rxp/s=17.9 rxk/s=2.5 tx
p/s=15.9 txk/s=2.5 CPU: usr=0 sys=0 iow=0 idle=99
0000002F 2011-10-24 15:35:06 31976 31990 Watchdog : Marking Machine as Down! [10.222.195.159:6600]
00000030 2011-10-24 15:35:06 31976 31990 Watchdog : Marking Machine as Down! [10.93.95.117:6600]
00000031 2011-10-24 15:36:06 31976 31989 SYS: PU= 0% MU= 4% MAL=235120 MMP=0 SBK=235120 TOT=264K RAM=675768K SWP=0K
00000032 2011-10-24 15:36:06 31976 31989 DSK: [sdb] r/s=0.0 kr/s=0.0 w/s=2.6 kw/s=10.2 bsy=0 NIC: rxp/s=18.1 rxk/s=2.5 t
xp/s=16.0 txk/s=2.5 CPU: usr=0 sys=0 iow=0 idle=99
00000033 2011-10-24 15:37:06 31976 31989 SYS: PU= 0% MU= 4% MAL=237328 MMP=0 SBK=237328 TOT=264K RAM=675856K SWP=0K
00000034 2011-10-24 15:37:06 31976 31989 DSK: [sdb] r/s=0.0 kr/s=0.0 w/s=2.4 kw/s=9.5 bsy=0 NIC: rxp/s=18.1 rxk/s=2.5 tx
p/s=16.1 txk/s=2.5 CPU: usr=0 sys=0 iow=0 idle=99
00000035 2011-10-24 15:38:06 31976 31989 SYS: PU= 0% MU= 4% MAL=237328 MMP=0 SBK=237328 TOT=264K RAM=675788K SWP=0K
00000036 2011-10-24 15:38:06 31976 31989 DSK: [sdb] r/s=0.0 kr/s=0.0 w/s=2.6 kw/s=10.4 bsy=0 NIC: rxp/s=17.8 rxk/s=2.4 t
xp/s=15.8 txk/s=2.5 CPU: usr=0 sys=0 iow=0 idle=99
00000037 2011-10-24 15:39:06 31976 31989 SYS: PU= 0% MU= 4% MAL=237328 MMP=0 SBK=237328 TOT=264K RAM=675876K SWP=0K
00000038 2011-10-24 15:39:06 31976 31989 DSK: [sdb] r/s=0.0 kr/s=0.0 w/s=2.5 kw/s=10.0 bsy=0 NIC: rxp/s=18.2 rxk/s=2.5 t
xp/s=16.2 txk/s=2.5 CPU: usr=0 sys=0 iow=0 idle=99
00000039 2011-10-24 15:40:06 31976 31989 SYS: PU= 0% MU= 4% MAL=237328 MMP=0 SBK=237328 TOT=264K RAM=676116K SWP=0K
0000003A 2011-10-24 15:40:06 31976 31989 DSK: [sdb] r/s=0.0 kr/s=0.0 w/s=2.4 kw/s=9.5 bsy=0 NIC: rxp/s=17.9 rxk/s=2.5 tx
p/s=15.9 txk/s=2.5 CPU: usr=0 sys=0 iow=0 idle=99
0000003B 2011-10-24 15:40:06 31976 31990 10056: /var/jenkins/workspace/Release-3.2.0/src/thorlcr/master/thgraphmanager.c
pp(786) : abortThor : Watchdog has lost contact with Thor slave: 10.222.195.159:6600 (Process terminated or node down?)
0000003C 2011-10-24 15:40:06 31976 31990 abortThor called
0000003D 2011-10-24 15:40:06 31976 31990 Stopping jobManager
0000003E 2011-10-24 15:40:06 31976 31990 aborting any current active job
0000003F 2011-10-24 15:40:06 31976 31976 acceptConversation aborted - terminating
00000040 2011-10-24 15:40:06 31976 31976 ,Progress,Thor,Terminate,mythor,mythor,mythor.thor
00000041 2011-10-24 15:40:06 31976 31976 ThorMaster terminated OK
00000042 2011-10-24 15:40:07 31976 31976 priority set id=140408941676288 policy=0 pri=0 PID=31976
00000043 2011-10-24 15:40:07 31976 31976 Stopping watchdog
00000044 2011-10-24 15:40:07 31976 31976 Stopped watchdog
00000045 2011-10-24 15:40:07 31976 31976 Thor closing down 6
00000046 2011-10-24 15:40:07 31976 31976 Thor closing down 5
00000047 2011-10-24 15:40:07 31976 31976 Thor closing down 4
00000048 2011-10-24 15:40:07 31976 31976 Thor closing down 3
00000049 2011-10-24 15:40:07 31976 31976 Thor closing down 2
0000004A 2011-10-24 15:40:07 31976 31976 Thor closing down 1
nvasil
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 105
Joined: Mon Oct 17, 2011 6:48 pm

Mon Oct 24, 2011 10:46 pm Change Time Zone

Thanks, this master log shows no jobs have run, but still lost contact with the slave nodes:

0000002F 2011-10-24 15:35:06 31976 31990 Watchdog : Marking Machine as Down! [10.222.195.159:6600]
00000030 2011-10-24 15:35:06 31976 31990 Watchdog : Marking Machine as Down! [10.93.95.117:6600]

These message will be first reported after ~10 minutes, if there have been no wathdog messages from the slaves.
15:35:06 is precisely 10 minutes after Thor started.
So it looks like the master is not receiving any watchdog messages from the slaves, at least in this run.

Could it be that the watchdog messages are being blocked by a firewall issue?
Watchdog messages are UDP packets and will be on port 6606 by default.

The slave logs won't have much in them during this period if they're not running a job, but assuming the problem persists, it would be interesting to check the slave processes during the 1st 10-15 minutes of a thor start.
If the jobs are otherwise processing normally, until the 10 minute mark when the 'Watchdog : Marking Machine as Down!' will appear in the master log, followed 5 minutes later, by the master reporting 'Watchdog has lost contact with Thor slave..' and restarting.
Then it definitively suggests, the udp packets are being blocked.

Hope that helps.
jsmith
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 80
Joined: Tue Jul 19, 2011 12:58 pm

Mon Oct 24, 2011 11:09 pm Change Time Zone

I would doubt it is a firewall issue. I am running the system on EC2 following instructions from HPCC. Some queries run fine for example
if rdata:=rdata1+rdata2+rdata3 it runs fine. I can run many queries (simple ones) on the cluster. When they become more data intensive it fails
Interesting enough if I add more things to be done it doesn't run. Also did you notice that the process on the slave actually dies because of a JMALLOC error?

People from HPCC have reassured me that they run everything fine on the cloud so it cannot be a cloud issue.
nvasil
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 105
Joined: Mon Oct 17, 2011 6:48 pm

Tue Oct 25, 2011 1:40 am Change Time Zone

So I did try the most recent version HPCC Platform v03.2.2-1 and I opened the UDP port you suggested and it worked.

Thanks very much
nvasil
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 105
Joined: Mon Oct 17, 2011 6:48 pm


Return to ECL

Who is online

Users browsing this forum: No registered users and 1 guest

cron