Sat Aug 18, 2018 11:44 pm
Login Register Lost Password? Contact Us


Cluster not updating with new environment.xml

Topics related to recommendations or questions on the design for HPCC Systems clusters

Thu Jun 05, 2014 8:45 pm Change Time Zone

Hello,

I am trying to set up a 2-node HPCC cluster. I have followed the steps in the "Configuring a Multinode System" section of the HPCC Installation guide. When I initially pushed out the environment.xml to all of the nodes, I had the wrong IP address for one of the machines.

To mitigate this issue, I pushed out a new environment.xml (with the correct IP address) to the machines and restarted them. When starting the cluster, thor will not start.

Upon further inspection, it seems the thor cluster is still looking for the machine with the wrong IP address. When I look at the thor cluster in ECL Watch, it clearly lists the wrong IP address in its list of machines, but does not have the new, correct IP address.

I believe the old IP address is somehow cached in the system. What can I do to make HPCC read in the new IP address?

Thanks for your help.
fmorstatter
 
Posts: 10
Joined: Thu Jun 05, 2014 8:28 pm

Fri Jun 06, 2014 11:50 am Change Time Zone

You might want to check out this older forum thread:

http://hpccsystems.com/bb/viewtopic.php?f=14&t=932

If that doesn't work, you can try a heavy-handed approach. The system copies excerpts of the environment.xml file to another location for runtime purposes (this allows environment.xml to be updated without affecting a running cluster). Those copies are supposed to be rebuilt during startup, but it's possible that they are not in this case. So, you can try shutting down the cluster, deleting /var/lib/HPCCSystems/mythor/slaves on each of your nodes, then starting the cluster back up. That file is one of the excerpts, and it will be rebuilt if missing.

Hope one of these helps.

Cheers,

Dan
DSC
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 552
Joined: Tue Oct 18, 2011 4:45 pm

Fri Jun 06, 2014 1:19 pm Change Time Zone

Just to add to what Dan said from our HPCC team:

1. Validate the environment xml
Code: Select all
md5sum /etc/HPCCSystems/source/<modified xml> vs md5sum /etc/HPCCSystems/environment.xml

(the running xml the system reads)

2. Make sure you restart all the components to read in the new xml. (Not just the THOR) ECLWatch gets the information about the components/Environment from DALI.-this is probably the root cause


HTH,

Bob
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 975
Joined: Wed Jun 29, 2011 7:13 pm

Fri Jun 06, 2014 4:09 pm Change Time Zone

Thank you for the pointers, my machine now can see the other node. I have one more hitch that is preventing me form running thor.

When I go to start the services on the slave machine it says "No components on this node as defined by /etc/HPCCSystems/environment.xml". When I inspect the file manually, it appears to indicate that the node is a mythor slave. I've attached the configuration file. It is called environment.xml on the server, I had to add the .txt extension to get past the forum filters.
Attachments
environment.xml.txt
(36.57 KiB) Downloaded 282 times
fmorstatter
 
Posts: 10
Joined: Thu Jun 05, 2014 8:28 pm

Mon Jun 09, 2014 10:33 am Change Time Zone

>When I go to start the services on the slave machine it says "No components on this node as defined by /etc/HPCCSystems/environment.xml".


That sounds correct.

The thorslaves are directly managed, started and stopped by the thormaster, not by the service.
So unless there are other components on the slave node, a 'No components on this node..' is expected.

It might be slightly clearer perhaps, if that message said 'No components to start on this node ..' or similar.
jsmith
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 70
Joined: Tue Jul 19, 2011 12:58 pm

Mon Jun 09, 2014 1:27 pm Change Time Zone

Thanks for the help with this issue. I thought this was causing the other issues I was seeing in ECL Watch, but since this is expected behavior I'll explain what else is going on.

When I go to ECL Watch, and click "target clusters", "ThorCluster - thor" switches between a green light saying everything is OK and a warning sign saying "ThorCluster - thorCluster not attached". When I click the error message, it says "0 2014-06-09 13:25:31 GMT: Cannot connect to SDS cluster mythor".

Also, any workunit I submit to the cluster gets blocked and stays in the blocked state until it times out.

Any idea why this might be happening?
fmorstatter
 
Posts: 10
Joined: Thu Jun 05, 2014 8:28 pm

Mon Jun 09, 2014 3:24 pm Change Time Zone

Hi, I was wondering what version of the platform you're currently running as well.
clo
 
Posts: 51
Joined: Thu May 12, 2011 11:57 am

Mon Jun 09, 2014 3:32 pm Change Time Zone

"ThorCluster - thorCluster not attached"

That is probably an indication that thormaster is not running.
Normally thormaster (and most other components) will auto restart though, so whilst it wouldn't be surprising to see '... not attached' for a short period, if a component, e.g. Thor recycled, for it to be sustained would be.
For example, if you deliberately kill the thormaster process, it is possible to briefly reproduce the "ThorCluster - thorCluster not attached", but in a few seconds it will be rerun.

So if it's consistently '.. not attached' for some time, it sounds like something is preventing the thormaster starting again, or it's alive and the process is defunct in some way.

I suspect the current thormaster log, when the system is in this state, will shed some clues on what's going on.

Can you attach here?
Thanks.
jsmith
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 70
Joined: Tue Jul 19, 2011 12:58 pm

Mon Jun 09, 2014 6:54 pm Change Time Zone

Find the last few lines from my log file below. It is over the 256KiB limit, so I cannot upload. You are right, it looks like it is not seeing the slave correctly. Thanks for any insight you can shed onto this.


------------------------------------------------------------------------

0000000E 2014-06-09 14:51:38.153 8000 8000 "ThorMaster version 4.1, Started on 128.2.218.180:20000"
0000000D 2014-06-09 14:51:38.153 8000 8012 "Started watchdog"
0000000F 2014-06-09 14:51:38.153 8000 8000 "Thor name = mythor, queue = thor.thor, nodeGroup = mythor"
00000010 2014-06-09 14:51:38.153 8000 8000 "Creating sentinel file thor.sentinel for rerun from script"
00000011 2014-06-09 14:51:38.153 8000 8000 "Waiting for 1 slaves to register"
00000012 2014-06-09 14:51:38.153 8000 8000 "Verifying connection to slave 1"
00000013 2014-06-09 14:51:38.187 8000 8000 "verified connection with 128.2.219.77:20100"
00000014 2014-06-09 14:51:38.187 8000 8000 "Slaves connected, initializing.."
00000015 2014-06-09 14:51:38.187 8000 8000 "Initialization sent to slave group"
00000016 2014-06-09 14:51:38.188 8000 8000 "Registration confirmation from 128.2.219.77:20100"
00000017 2014-06-09 14:51:38.188 8000 8000 "Slave 1 (128.2.219.77:20100) registered"
00000018 2014-06-09 14:51:38.188 8000 8000 "Slaves initialized"
00000019 2014-06-09 14:51:38.188 8000 8000 "verifying mp connection to rest of cluster"
0000001A 2014-06-09 14:51:38.188 8000 8000 "verified mp connection to rest of cluster"
0000001B 2014-06-09 14:51:38.188 8000 8000 ",Progress,Thor,Startup,mythor,mythor,thor.thor,//128.2.218.180/var/log/HPCCSystems/mythor/thormaster.2014_06_09.log"
0000001C 2014-06-09 14:51:38.188 8000 8000 "Listening for graph"
0000001D 2014-06-09 14:51:38.191 8000 8013 "WARNING: /var/lib/jenkins/workspace/CE-Candidate-with-plugins-4.2.4-3/CE/ubuntu-12.04-amd64/HPCC-Platform/system/mp/mpcomm.cpp(2225) : CInterCommunicator: ignoring closed endpoint: 128.2.219.77:20100"
0000001E 2014-06-09 14:51:38.191 8000 8008 "WARNING: /var/lib/jenkins/workspace/CE-Candidate-with-plugins-4.2.4-3/CE/ubuntu-12.04-amd64/HPCC-Platform/system/mp/mpcomm.cpp(2225) : CInterCommunicator: ignoring closed endpoint: 128.2.219.77:20100"
0000001F 2014-06-09 14:51:38.191 8000 8012 "ERROR: 10056: /var/lib/jenkins/workspace/CE-Candidate-with-plugins-4.2.4-3/CE/ubuntu-12.04-amd64/HPCC-Platform/thorlcr/master/thgraphmanager.cpp(787) : abortThor : Watchdog has lost connectivity with Thor slave: 128.2.219.77:20100 (Process terminated or node down?)"
00000020 2014-06-09 14:51:38.191 8000 8012 "abortThor called"
00000021 2014-06-09 14:51:38.191 8000 8012 "Stopping jobManager"
00000022 2014-06-09 14:51:38.191 8000 8012 "aborting any current active job"
00000023 2014-06-09 14:51:38.191 8000 8012 "Watchdog : Unknown Machine! [0.0.0.0]"
00000024 2014-06-09 14:51:38.193 8000 8000 ",Progress,Thor,Terminate,mythor,mythor,thor.thor"
00000025 2014-06-09 14:51:38.193 8000 8000 "ThorMaster terminated OK"
00000026 2014-06-09 14:51:39.194 8000 8000 "priority set id=140199223068416 policy=0 pri=0 PID=8000"
00000027 2014-06-09 14:51:39.194 8000 8000 "Stopping watchdog"
00000028 2014-06-09 14:51:39.194 8000 8000 "Stopped watchdog"
00000029 2014-06-09 14:51:39.205 8000 8000 "Thor closing down 6"
0000002A 2014-06-09 14:51:39.205 8000 8000 "Thor closing down 5"
0000002B 2014-06-09 14:51:39.205 8000 8000 "Thor closing down 4"
0000002C 2014-06-09 14:51:39.205 8000 8000 "Thor closing down 3"
0000002D 2014-06-09 14:51:39.205 8000 8000 "Thor closing down 2"
0000002E 2014-06-09 14:51:39.216 8000 8000 "Thor closing down 1"
00000002 2014-06-09 14:51:40.598 8205 8205 "Opened log file //128.2.218.180/var/log/HPCCSystems/mythor/thormaster.2014_06_09.log"
00000003 2014-06-09 14:51:40.598 8205 8205 "Build community_4.2.4-3"
00000004 2014-06-09 14:51:40.598 8205 8205 "calling initClientProcess Port 20000"
00000005 2014-06-09 14:51:40.599 8205 8205 "Found file 'thorgroup', using to form thor group"
00000006 2014-06-09 14:51:40.599 8205 8205 "Checking cluster replicate nodes"
00000007 2014-06-09 14:51:40.603 8205 8205 "Cluster replicate nodes check completed in 4ms"
00000008 2014-06-09 14:51:40.604 8205 8205 "Global memory size = 9008 MB"
00000009 2014-06-09 14:51:40.604 8205 8205 "RoxieMemMgr: Setting memory limit to 9445572608 bytes (9008 pages)"
0000000A 2014-06-09 14:51:40.604 8205 8205 "RoxieMemMgr: 9024 Pages successfully allocated for the pool - memsize=9462349824 base=0x7f834bf00000 alignment=1048576 bitmapSize=282"
0000000B 2014-06-09 14:51:40.606 8205 8205 "Disk space: /var/lib/HPCCSystems/hpcc-data/thor = 1403102 MB, /var/lib/HPCCSystems/hpcc-mirror/thor = 0 MB, /var/lib/HPCCSystems/mythor/temp = 1403102 MB"
0000000C 2014-06-09 14:51:40.610 8205 8205 "Starting watchdog"
0000000E 2014-06-09 14:51:40.610 8205 8205 "ThorMaster version 4.1, Started on 128.2.218.180:20000"
0000000D 2014-06-09 14:51:40.610 8205 8217 "Started watchdog"
0000000F 2014-06-09 14:51:40.610 8205 8205 "Thor name = mythor, queue = thor.thor, nodeGroup = mythor"
00000010 2014-06-09 14:51:40.610 8205 8205 "Creating sentinel file thor.sentinel for rerun from script"
00000011 2014-06-09 14:51:40.611 8205 8205 "Waiting for 1 slaves to register"
00000012 2014-06-09 14:51:40.611 8205 8205 "Verifying connection to slave 1"
00000013 2014-06-09 14:51:40.627 8205 8205 "verified connection with 128.2.219.77:20100"
00000014 2014-06-09 14:51:40.627 8205 8205 "Slaves connected, initializing.."
00000015 2014-06-09 14:51:40.628 8205 8205 "Initialization sent to slave group"
00000016 2014-06-09 14:51:40.628 8205 8205 "Registration confirmation from 128.2.219.77:20100"
00000017 2014-06-09 14:51:40.628 8205 8205 "Slave 1 (128.2.219.77:20100) registered"
00000018 2014-06-09 14:51:40.628 8205 8205 "Slaves initialized"
00000019 2014-06-09 14:51:40.628 8205 8205 "verifying mp connection to rest of cluster"
0000001A 2014-06-09 14:51:40.628 8205 8205 "verified mp connection to rest of cluster"
0000001B 2014-06-09 14:51:40.628 8205 8205 ",Progress,Thor,Startup,mythor,mythor,thor.thor,//128.2.218.180/var/log/HPCCSystems/mythor/thormaster.2014_06_09.log"
0000001C 2014-06-09 14:51:40.629 8205 8205 "Listening for graph"
0000001D 2014-06-09 14:51:40.631 8205 8213 "WARNING: /var/lib/jenkins/workspace/CE-Candidate-with-plugins-4.2.4-3/CE/ubuntu-12.04-amd64/HPCC-Platform/system/mp/mpcomm.cpp(2225) : CInterCommunicator: ignoring closed endpoint: 128.2.219.77:20100"
0000001E 2014-06-09 14:51:40.631 8205 8217 "ERROR: 10056: /var/lib/jenkins/workspace/CE-Candidate-with-plugins-4.2.4-3/CE/ubuntu-12.04-amd64/HPCC-Platform/thorlcr/master/thgraphmanager.cpp(787) : abortThor : Watchdog has lost connectivity with Thor slave: 128.2.219.77:20100 (Process terminated or node down?)"
0000001F 2014-06-09 14:51:40.631 8205 8217 "abortThor called"
00000020 2014-06-09 14:51:40.631 8205 8218 "WARNING: /var/lib/jenkins/workspace/CE-Candidate-with-plugins-4.2.4-3/CE/ubuntu-12.04-amd64/HPCC-Platform/system/mp/mpcomm.cpp(2225) : CInterCommunicator: ignoring closed endpoint: 128.2.219.77:20100"
00000021 2014-06-09 14:51:40.631 8205 8217 "Stopping jobManager"
00000022 2014-06-09 14:51:40.632 8205 8217 "aborting any current active job"
00000023 2014-06-09 14:51:40.632 8205 8217 "Watchdog : Unknown Machine! [0.0.0.0]"
00000024 2014-06-09 14:51:40.632 8205 8205 ",Progress,Thor,Terminate,mythor,mythor,thor.thor"
00000025 2014-06-09 14:51:40.632 8205 8205 "ThorMaster terminated OK"
00000026 2014-06-09 14:51:41.634 8205 8205 "priority set id=140211580438272 policy=0 pri=0 PID=8205"
00000027 2014-06-09 14:51:41.634 8205 8205 "Stopping watchdog"
00000028 2014-06-09 14:51:41.634 8205 8205 "Stopped watchdog"
00000029 2014-06-09 14:51:41.644 8205 8205 "Thor closing down 6"
0000002A 2014-06-09 14:51:41.644 8205 8205 "Thor closing down 5"
0000002B 2014-06-09 14:51:41.644 8205 8205 "Thor closing down 4"
0000002C 2014-06-09 14:51:41.644 8205 8205 "Thor closing down 3"
0000002D 2014-06-09 14:51:41.645 8205 8205 "Thor closing down 2"
0000002E 2014-06-09 14:51:41.655 8205 8205 "Thor closing down 1"
00000002 2014-06-09 14:51:43.046 8412 8412 "Opened log file //128.2.218.180/var/log/HPCCSystems/mythor/thormaster.2014_06_09.log"
00000003 2014-06-09 14:51:43.046 8412 8412 "Build community_4.2.4-3"
00000004 2014-06-09 14:51:43.046 8412 8412 "calling initClientProcess Port 20000"
00000005 2014-06-09 14:51:43.048 8412 8412 "Found file 'thorgroup', using to form thor group"
00000006 2014-06-09 14:51:43.048 8412 8412 "Checking cluster replicate nodes"
00000007 2014-06-09 14:51:43.049 8412 8412 "Cluster replicate nodes check completed in 1ms"
00000008 2014-06-09 14:51:43.050 8412 8412 "Global memory size = 9008 MB"
00000009 2014-06-09 14:51:43.050 8412 8412 "RoxieMemMgr: Setting memory limit to 9445572608 bytes (9008 pages)"
0000000A 2014-06-09 14:51:43.050 8412 8412 "RoxieMemMgr: 9024 Pages successfully allocated for the pool - memsize=9462349824 base=0x7fe563f00000 alignment=1048576 bitmapSize=282"
0000000B 2014-06-09 14:51:43.050 8412 8412 "Disk space: /var/lib/HPCCSystems/hpcc-data/thor = 1403101 MB, /var/lib/HPCCSystems/hpcc-mirror/thor = 0 MB, /var/lib/HPCCSystems/mythor/temp = 1403101 MB"
0000000C 2014-06-09 14:51:43.052 8412 8412 "Starting watchdog"
0000000E 2014-06-09 14:51:43.052 8412 8412 "ThorMaster version 4.1, Started on 128.2.218.180:20000"
0000000D 2014-06-09 14:51:43.052 8412 8424 "Started watchdog"
0000000F 2014-06-09 14:51:43.052 8412 8412 "Thor name = mythor, queue = thor.thor, nodeGroup = mythor"
00000010 2014-06-09 14:51:43.053 8412 8412 "Creating sentinel file thor.sentinel for rerun from script"
00000011 2014-06-09 14:51:43.053 8412 8412 "Waiting for 1 slaves to register"
00000012 2014-06-09 14:51:43.053 8412 8412 "Verifying connection to slave 1"
00000013 2014-06-09 14:51:43.086 8412 8412 "verified connection with 128.2.219.77:20100"
00000014 2014-06-09 14:51:43.086 8412 8412 "Slaves connected, initializing.."
00000015 2014-06-09 14:51:43.086 8412 8412 "Initialization sent to slave group"
00000016 2014-06-09 14:51:43.087 8412 8412 "Registration confirmation from 128.2.219.77:20100"
00000017 2014-06-09 14:51:43.087 8412 8412 "Slave 1 (128.2.219.77:20100) registered"
00000018 2014-06-09 14:51:43.087 8412 8412 "Slaves initialized"
00000019 2014-06-09 14:51:43.087 8412 8412 "verifying mp connection to rest of cluster"
0000001A 2014-06-09 14:51:43.087 8412 8412 "verified mp connection to rest of cluster"
0000001B 2014-06-09 14:51:43.087 8412 8412 ",Progress,Thor,Startup,mythor,mythor,thor.thor,//128.2.218.180/var/log/HPCCSystems/mythor/thormaster.2014_06_09.log"
0000001C 2014-06-09 14:51:43.087 8412 8412 "Listening for graph"
0000001D 2014-06-09 14:51:43.090 8412 8420 "WARNING: /var/lib/jenkins/workspace/CE-Candidate-with-plugins-4.2.4-3/CE/ubuntu-12.04-amd64/HPCC-Platform/system/mp/mpcomm.cpp(2225) : CInterCommunicator: ignoring closed endpoint: 128.2.219.77:20100"
0000001E 2014-06-09 14:51:43.090 8412 8425 "WARNING: /var/lib/jenkins/workspace/CE-Candidate-with-plugins-4.2.4-3/CE/ubuntu-12.04-amd64/HPCC-Platform/system/mp/mpcomm.cpp(2225) : CInterCommunicator: ignoring closed endpoint: 128.2.219.77:20100"
0000001F 2014-06-09 14:51:43.090 8412 8424 "ERROR: 10056: /var/lib/jenkins/workspace/CE-Candidate-with-plugins-4.2.4-3/CE/ubuntu-12.04-amd64/HPCC-Platform/thorlcr/master/thgraphmanager.cpp(787) : abortThor : Watchdog has lost connectivity with Thor slave: 128.2.219.77:20100 (Process terminated or node down?)"
00000020 2014-06-09 14:51:43.090 8412 8424 "abortThor called"
00000021 2014-06-09 14:51:43.090 8412 8424 "Stopping jobManager"
00000022 2014-06-09 14:51:43.090 8412 8424 "aborting any current active job"
00000023 2014-06-09 14:51:43.090 8412 8424 "Watchdog : Unknown Machine! [0.0.0.0]"
00000024 2014-06-09 14:51:43.091 8412 8412 ",Progress,Thor,Terminate,mythor,mythor,thor.thor"
00000025 2014-06-09 14:51:43.091 8412 8412 "ThorMaster terminated OK"
fmorstatter
 
Posts: 10
Joined: Thu Jun 05, 2014 8:28 pm

Mon Jun 09, 2014 11:52 pm Change Time Zone

Just another piece of information regarding this problem:

It seems that it switches from "ThorCluster - thor" to "Cluster not attached" for several hours. Then, after hours it decides to permanently stay in the "ThorCluster - thor" position. The problem here is that, even though it looks like everything is in good shape, when I submit a job it simply says "RUNNING" and then goes to the "FAILED" state after about 30 minutes. When it is in the "RUNNING" state, nothing is going on in any of the servers (none are using any CPU).
fmorstatter
 
Posts: 10
Joined: Thu Jun 05, 2014 8:28 pm

Next

Return to Clustering

Who is online

Users browsing this forum: No registered users and 1 guest

cron