Wed Oct 20, 2021 7:26 pm
Login Register Lost Password? Contact Us


Deploying Indexes via Roxie Package

Comments and questions related to the Enterprise Control Language

Tue Feb 16, 2021 11:43 am Change Time Zone

Hi

I'm having several issues when deploying packagemaps to my roxies

Problem 1

When deploying each hour throughout the day I sometimes get the this message on a roxie
Code: Select all
Exception
Reported by: Roxie
Message: Query roxiewarmup.2 is suspended because Could not open file /var/lib/HPCCSystems/hpcc-data/roxie/globex/key_multiuserid_multiusersandmultipremium_202102161000._37_of_37
. When I look at the logs also below you can see the "No more data files to copy message", however, the data usually takes about 40 minutes to copy to the roxies and the no more data to copy appears after about 15 minutes. The files in question appear in my Thor cluster and are accessable. Once I receive this message I have to
Code: Select all
sudo service hpcc-init -c myroxie restart
to get the files copying again.

Code: Select all
0000599D PRG 2021-02-16 11:05:21.822  2846  2853 "Background copying //192.168.24.124:7100/var/lib/HPCCSystems/hpcc-data/thor/globex/key_airesourcescontentngrams_202102161000._4_of_145 to /var/lib/HPCCSystems/hpcc-data/roxie/globex/key_airesourcescontentngrams_202102161000._4_of_145"
0000599E PRG 2021-02-16 11:05:22.307  2846  2853 "Background copy to /var/lib/HPCCSystems/hpcc-data/roxie/globex/key_airesourcescontentngrams_202102161000._4_of_145 complete in 485 ms (32.7 MB/sec)"
0000599F PRG 2021-02-16 11:05:22.412  2846  2853 "Background copying //192.168.24.123:7100/var/lib/HPCCSystems/hpcc-data/thor/globex/key_airesourcescontentngrams_202102161000._3_of_145 to /var/lib/HPCCSystems/hpcc-data/roxie/globex/key_airesourcescontentngrams_202102161000._3_of_145"
000059A0 PRG 2021-02-16 11:05:22.688  2846  2853 "Background copy to /var/lib/HPCCSystems/hpcc-data/roxie/globex/key_airesourcescontentngrams_202102161000._3_of_145 complete in 276 ms (51.4 MB/sec)"
000059A1 PRG 2021-02-16 11:05:22.795  2846  2853 "Background copying //192.168.24.122:7100/var/lib/HPCCSystems/hpcc-data/thor/globex/key_airesourcescontentngrams_202102161000._2_of_145 to /var/lib/HPCCSystems/hpcc-data/roxie/globex/key_airesourcescontentngrams_202102161000._2_of_145"
000059A2 PRG 2021-02-16 11:05:23.436  2846  2853 "Background copy to /var/lib/HPCCSystems/hpcc-data/roxie/globex/key_airesourcescontentngrams_202102161000._2_of_145 complete in 642 ms (34.2 MB/sec)"
000059A3 PRG 2021-02-16 11:05:23.538  2846  2853 "Background copying //192.168.24.121:7100/var/lib/HPCCSystems/hpcc-data/thor/globex/key_airesourcescontentngrams_202102161000._1_of_145 to /var/lib/HPCCSystems/hpcc-data/roxie/globex/key_airesourcescontentngrams_202102161000._1_of_145"
000059A4 PRG 2021-02-16 11:05:24.277  2846  2853 "Background copy to /var/lib/HPCCSystems/hpcc-data/roxie/globex/key_airesourcescontentngrams_202102161000._1_of_145 complete in 739 ms (31.7 MB/sec)"
000059A5 PRG 2021-02-16 11:05:24.328  2846  2853 "No more data files to copy"
000059A6 PRG 2021-02-16 11:05:32.803  2846  2852 "SYS: LPT=15862 APT=316692 PU=  2% MU= 10% MAL=2362552320 MMP=2048638976 SBK=313913344 TOT=2311136K RAM=5312660K SWP=2528K RMU=  1% RMX=1023M"
000059A7 PRG 2021-02-16 11:05:32.804  2846  2852 "DSK: [sda] r/s=0.0 kr/s=0.0 w/s=164.3 kw/s=23817.0 bsy=54 NIC: [bond0] rxp/s=17978.0 rxk/s=25549.4 txp/s=1578.3 txk/s=110.4 rxerrs=0 rxdrps=166 txerrs=0 txdrps=0 CPU: usr=0 sys=1 iow=1 idle=97"
000059A8 PRG 2021-02-16 11:05:44.953  2846  8078 "PING: 1 replies received, average delay 781us"
000059A9 PRG 2021-02-16 11:06:32.825  2846  2852 "SYS: LPT=15862 APT=316692 PU=  0% MU= 10% MAL=2362552320 MMP=2048638976 SBK=313913344 TOT=2311136K RAM=5312004K SWP=2528K RMU=  1% RMX=1023M"
000059AA PRG 2021-02-16 11:06:32.826  2846  2852 "DSK: [sda] r/s=0.0 kr/s=0.0 w/s=1.8 kw/s=8.9 bsy=0 NIC: [bond0] rxp/s=13.2 rxk/s=4.1 txp/s=1.7 txk/s=0.7 rxerrs=0 rxdrps=162 txerrs=0 txdrps=0 CPU: usr=0 sys=0 iow=0 idle=99"
000059AB PRG 2021-02-16 11:06:44.954  2846  8078 "PING: 1 replies received, average delay 236us"
000059AC PRG 2021-02-16 11:07:32.847  2846  2852 "SYS: LPT=15862 APT=316692 PU=  0% MU= 10% MAL=2362552320 MMP=2048638976 SBK=313913344 TOT=2311136K RAM=5314436K SWP=2528K RMU=  1% RMX=1023M"
000059AD PRG 2021-02-16 11:07:32.847  2846  2852 "DSK: [sda] r/s=0.0 kr/s=0.0 w/s=0.4 kw/s=2.7 bsy=0 NIC: [bond0] rxp/s=16.0 rxk/s=4.2 txp/s=2.1 txk/s=0.7 rxerrs=0 rxdrps=161 txerrs=0 txdrps=0 CPU: usr=0 sys=0 iow=0 idle=99"
000059AE PRG 2021-02-16 11:07:44.954  2846  8078 "PING: 1 replies received, average delay 235us"
000059AF PRG 2021-02-16 11:08:32.866  2846  2852 "SYS: LPT=15862 APT=316692 PU=  0% MU= 10% MAL=2362552320 MMP=2048638976 SBK=313913344 TOT=2311136K RAM=5314436K SWP=2528K RMU=  1% RMX=1023M"
000059B0 PRG 2021-02-16 11:08:32.867  2846  2852 "DSK: [sda] r/s=0.0 kr/s=0.0 w/s=5.1 kw/s=28.0 bsy=1 NIC: [bond0] rxp/s=28.2 rxk/s=6.2 txp/s=12.8 txk/s=5.9 rxerrs=0 rxdrps=161 txerrs=0 txdrps=0 CPU: usr=0 sys=0 iow=0 idle=99"
000059B1 PRG 2021-02-16 11:08:36.234  2846  9084 "[192.168.20.25:9876{2}] FAILED: "
000059B2 PRG 2021-02-16 11:08:36.234  2846  9084 "[192.168.20.25:9876{2}] EXCEPTION: Query roxiewarmup.2 is suspended because Could not open file /var/lib/HPCCSystems/hpcc-data/roxie/globex/key_multiuserid_multiusersandmultipremium_202102161000._37_of_37"
000059B3 PRG 2021-02-16 11:08:44.954  2846  8078 "PING: 1 replies received, average delay 160us"
000059B4 PRG 2021-02-16 11:09:32.889  2846  2852 "SYS: LPT=15862 APT=316692 PU=  0% MU= 10% MAL=2362552320 MMP=2048638976 SBK=313913344 TOT=2311136K RAM=5254920K SWP=2528K RMU=  1% RMX=1023M"
000059B5 PRG 2021-02-16 11:09:32.889  2846  2852 "DSK: [sda] r/s=0.0 kr/s=0.0 w/s=0.8 kw/s=6.9 bsy=0 NIC: [bond0] rxp/s=17.4 rxk/s=4.9 txp/s=5.8 txk/s=2.3 rxerrs=0 rxdrps=160 txerrs=0 txdrps=0 CPU: usr=0 sys=0 iow=0 idle=99"
000059B6 PRG 2021-02-16 11:09:44.955  2846  8078 "PING: 1 replies received, average delay 256us"
000059B7 PRG 2021-02-16 11:10:32.910  2846  2852 "SYS: LPT=15862 APT=316692 PU=  0% MU= 10% MAL=2362552320 MMP=2048638976 SBK=313913344 TOT=2311136K RAM=5255156K SWP=2528K RMU=  1% RMX=1023M"
000059B8 PRG 2021-02-16 11:10:32.910  2846  2852 "DSK: [sda] r/s=0.0 kr/s=0.0 w/s=0.5 kw/s=3.1 bsy=0 NIC: [bond0] rxp/s=12.2 rxk/s=4.0 txp/s=1.4 txk/s=0.6 rxerrs=0 rxdrps=162 txerrs=0 txdrps=0 CPU: usr=0 sys=0 iow=0 idle=99"
000059B9 PRG 2021-02-16 11:10:40.757  2846  9084 "connectChild connecting to 192.168.20.25:9876"
000059BA PRG 2021-02-16 11:10:40.757  2846  9084 "connectChild connected to 192.168.20.25:9876"
000059BB PRG 2021-02-16 11:10:40.758  2846 23600 "[192.168.20.25:9876{4}] doControlMessage - control:state"
000059BC PRG 2021-02-16 11:10:44.955  2846  8078 "PING: 1 replies received, average delay 232us"
000059BD PRG 2021-02-16 11:11:32.931  2846  2852 "SYS: LPT=15862 APT=316692 PU=  0% MU= 10% MAL=2362552320 MMP=2048638976 SBK=313913344 TOT=2311136K RAM=5258284K SWP=2528K RMU=  1% RMX=1023M"
000059BE PRG 2021-02-16 11:11:32.931  2846  2852 "DSK: [sda] r/s=0.0 kr/s=0.0 w/s=2.3 kw/s=13.9 bsy=0 NIC: [bond0] rxp/s=13.0 rxk/s=4.0 txp/s=1.2 txk/s=0.3 rxerrs=0 rxdrps=167 txerrs=0 txdrps=0 CPU: usr=0 sys=0 iow=0 idle=99"
000059BF PRG 2021-02-16 11:11:44.956  2846  8078 "PING: 1 replies received, average delay 246us"
000059C0 PRG 2021-02-16 11:11:47.464  2846  9084 "[192.168.20.25:9876{5}] doControlMessage - control:queries"
000059C1 PRG 2021-02-16 11:12:27.734  2846  9084 "RoxieMemMgr: Heap size 4096 pages, 4095 free, largest block 4095, heapLWM 0, heapHWM 128, dataBuffersActive=0, dataBufferPages=0"
000059C2 PRG 2021-02-16 11:12:32.952  2846  2852 "SYS: LPT=15862 APT=316692 PU=  0% MU= 10% MAL=2363887616 MMP=2049974272 SBK=313913344 TOT=2312440K RAM=5258076K SWP=2528K RMU=  1% RMX=1023M"
000059C3 PRG 2021-02-16 11:12:32.953  2846  2852 "DSK: [sda] r/s=1.1 kr/s=11.1 w/s=0.4 kw/s=3.7 bsy=0 NIC: [bond0] rxp/s=15.2 rxk/s=4.3 txp/s=1.9 txk/s=0.8 rxerrs=0 rxdrps=168 txerrs=0 txdrps=0 CPU: usr=0 sys=0 iow=0 idle=99"
000059C4 PRG 2021-02-16 11:12:44.956  2846  8078 "PING: 1 replies received, average delay 265us"
000059C5 PRG 2021-02-16 11:13:32.974  2846  2852 "SYS: LPT=15862 APT=316692 PU=  0% MU= 10% MAL=2363887616 MMP=2049974272 SBK=313913344 TOT=2312440K RAM=5258880K SWP=2528K RMU=  1% RMX=1023M"
000059C6 PRG 2021-02-16 11:13:32.975  2846  2852 "DSK: [sda] r/s=0.0 kr/s=0.0 w/s=0.7 kw/s=4.7 bsy=0 NIC: [bond0] rxp/s=13.4 rxk/s=4.0 txp/s=0.9 txk/s=0.2 rxerrs=0 rxdrps=162 txerrs=0 txdrps=0 CPU: usr=0 sys=0 iow=0 idle=99"
000059C7 PRG 2021-02-16 11:13:44.957  2846  8078 "PING: 1 replies received, average delay 217us"



Problem 2

Our second issue is when deploying our Roxie Package to 3 roxies one in every 3 deploys fails and the roxies fail to accept the soap request to replace the current package.

We managed to get some information from our logs
Code: Select all
0000C6D4 PRG 2021-02-15 07:49:02.701 41665 42734 "MP: Possible clash between 192.168.24.120:7070->192.168.20.25:7339 0(0)" 0000DA3D PRG 2021-02-15 10:50:26.156 41665 42734 "MP: Possible clash between 192.168.24.120:7070->192.168.20.26:7166 0(0)" 0000D4A2 PRG 2021-02-15 10:49:09.333 41665 42734 "MP: Possible clash between 192.168.24.120:7070->192.168.20.27:7475 0(0)" 0000C5F3 PRG 2021-02-15 06:50:23.516 41665 42734 "MP: Possible clash between 192.168.24.120:7070->192.168.20.26:7156 0(0)" 0000C5F4 PRG 2021-02-15 06:50:23.516 41665 42734 "Message Passing - removing stale socket to 192.168.20.26:7156"


If I clone the failed job to force the package in I start getting the issues in Problem 1.

Can anyone please shed any light or push us in the right direction?

We are using version 7.8.46-1, however we are upgrading to 7.12.24.

Thanks

David
daviddasher
 
Posts: 14
Joined: Fri Dec 08, 2017 12:39 pm

Mon Apr 19, 2021 4:32 pm Change Time Zone

Hi David,

Sorry for the delay in reply! Did anyone reach out to yet with a resolution?
If you haven't already done so, this looks like something that needs to be reported to our Issue Tracker.

https://track.hpccsystems.com/secure/Dashboard.jspa

Thank you!

Bob
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1005
Joined: Wed Jun 29, 2011 7:13 pm

Tue Apr 20, 2021 4:21 pm Change Time Zone

Hi Bob

No worries at all.

It turns out we had some issues with a firewall which would terminate the connection between Dali and roxie after an hour. Initially we created a new set of roxies in the same subnet which eliminated the issue and then tracked it back to the firewall rule on the original roxies.

I do need to report via tracker so I'll chase our firewall team on all the details.

Thanks for checking and I hope you are well.

Thanks

David
daviddasher
 
Posts: 14
Joined: Fri Dec 08, 2017 12:39 pm


Return to ECL

Who is online

Users browsing this forum: No registered users and 1 guest