Wed Aug 15, 2018 2:53 pm
Login Register Lost Password? Contact Us


MyThor is not running in cluster

Topics related to recommendations or questions on the design for HPCC Systems clusters

Tue Feb 03, 2015 6:47 am Change Time Zone

Hi,
I have 3 VM reservations installed with HPCC 5.0.2.1. I'm building a script that automates HPCC cluster formation when these three machines boots up.

Exchange of SSH Keys, and environment.xml files are successful. But when I try to start the service hpcc-init using /opt/HPCCSystems/sbin/hpcc-run.sh script for the first time, except mythor service all other services are running. However, when I restart the hpcc-init service using the same script, all the services are running.

To successfully start mythor service, atleast one restart of the entire hpcc services is required. Why doesn't mythor service run at the first start? Can it be resolved? Because it takes some time to restart hpcc service in all the machines. This delays the service availability to the end user.

Below is the status of the services in each machine after first start.
X.X.X.154 hpcc-init status :
mydafilesrv ( pid 2954 ) is running...
mydfuserver ( pid 3044 ) is running...
myeclscheduler ( pid 3143 ) is running...

X.X.X.153 hpcc-init status :
mydafilesrv ( pid 2391 ) is running...
mydali ( pid 2481 ) is running...
myeclccserver ( pid 2585 ) is running...

X.X.X.63 hpcc-init status :
mydafilesrv ( pid 3260 ) is running...
myeclagent ( pid 3354 ) is running...
myesp ( pid 3450 ) is running...
mysasha ( pid 3548 ) is running...
mythor is stopped


After the first start, when I try to check the status of the services, hpcc-run.sh script print the below statement.
Error found during hpcc-init_status_3795 execution.
Reference following log for more information:
/var/log/HPCCSystems/cluster/cc_hpcc-init_status_3795_20150203_012107.log

These are the last few lines of the log.
2015-02-03 01:21:12,385 - hpcc.cluster.ScriptTask.2 - ERROR - X.X.X.63: Host is alive.
X.X.X.63: Running sudo /etc/init.d/hpcc-init status

2015-02-03 01:21:12,385 - hpcc.cluster.ScriptTask.2 - INFO - result: FAILED
2015-02-03 01:21:14,128 - hpcc.cluster - INFO - script execution done.
lakshmannaresh
 
Posts: 15
Joined: Tue Feb 03, 2015 5:20 am

Tue Feb 03, 2015 1:54 pm Change Time Zone

The HPCC team took a look at your post, but we need some more information.

How are you configuring your THOR cluster with regards to the number of slave nodes?

Also, if you have the thormaster log, we would like to take a look at that as well.

Regards,

Bob
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 975
Joined: Wed Jun 29, 2011 7:13 pm

Tue Feb 03, 2015 7:12 pm Change Time Zone

Hi Bob,
I have attached thormaster log.
Below is the configuration parameters that are passed to envgen script to generate environment.xml.
number of thor nodes: 1
number of thor slaves per node: 1
Attachments
thormaster.2015_02_03.log
Thormaster log
(17.38 KiB) Downloaded 227 times
lakshmannaresh
 
Posts: 15
Joined: Tue Feb 03, 2015 5:20 am

Tue Feb 03, 2015 9:10 pm Change Time Zone

Can you please attach a copy of your environment.xml and the thorslave log? I'll try to get to the bottom of this for you.

Michael
mgardner
 
Posts: 13
Joined: Tue Jan 20, 2015 9:30 pm

Tue Feb 03, 2015 11:30 pm Change Time Zone

Hi Michael,
I haven't taken backup of the environment.xml file for the set of IPs that I posted earlier. Now I have created the same scenario with a new set of machines because the VMs I work with are temporary ones i.e., whenever I request VMs I will get a new set of machines. The thor cluster configuration remains the same. I have attached thormaster log, thorslave log, environment.xml file.

Below are the services running at each machine.
X.X.X.240 hpcc-init status :
mydafilesrv ( pid 3175 ) is running...
mydfuserver ( pid 5924 ) is running...
myeclscheduler ( pid 6023 ) is running...

X.X.X.70 hpcc-init status :
mydafilesrv ( pid 3266 ) is running...
myeclagent ( pid 9561 ) is running...
myesp ( pid 9657 ) is running...
mysasha ( pid 9758 ) is running...
mythor ( pid 10721 ) is running...

X.X.X.167 hpcc-init status :
mydafilesrv ( pid 3099 ) is running...
mydali ( pid 6871 ) is running...
myeclccserver ( pid 6975 ) is running...

Thanks Michael..
Attachments
thorslave.1.2015_02_03.log
thorslave log
(1.16 KiB) Downloaded 231 times
thormaster.2015_02_03.log
thormaster log
(12.12 KiB) Downloaded 229 times
environment.txt
environment.xml file
(37.88 KiB) Downloaded 229 times
lakshmannaresh
 
Posts: 15
Joined: Tue Feb 03, 2015 5:20 am

Wed Feb 11, 2015 9:38 pm Change Time Zone

Thank you for posting the files. The team is still reviewing and will circle back soon.
admin
Site Admin
Site Admin
 
Posts: 203
Joined: Thu Jan 27, 2011 10:58 am

Thu Feb 12, 2015 3:03 pm Change Time Zone

Hi,

I may be looking at out of date log files but I do not understand the IP addresses. Thormaster log shows:

0000000C 2015-02-03 18:05:33.968 5481 5481 "ThorMaster version 4.1, Started on X.X.X.69:20000"

which suggests its IP address is X.X.X.69

And it is trying to connect with a thorslave on X.X.X.167:

00000012 2015-02-03 18:05:33.973 5481 5481 "verified connection with X.X.X.167:20100"

But Thorslave log shows:

00000002 2015-02-03 18:05:33.828 3850 3850 "registering X.X.X.68:20100 - master X.X.X.70:20000"

which suggests it is X.X.X.68 and the master is X.X.X.70. Can we verify all hosts and IPs again ?

thanks,
mark
mkellyhpcc
 
Posts: 15
Joined: Mon Mar 10, 2014 2:51 pm

Mon Feb 16, 2015 4:10 pm Change Time Zone

Hi Mark,
There are two NICs for each machine, one of them is public facing NIC and another one is internal. Below are the pair of IP for each node.
Master - X.X.X.69/X.X.X.70
Slave - X.X.X.68/X.X.X.167

Thanks,
Lakshman Naresh
lakshmannaresh
 
Posts: 15
Joined: Tue Feb 03, 2015 5:20 am

Tue Feb 17, 2015 2:48 pm Change Time Zone

Hi,

Can you send the output from

ifconfig

on all 3 machines ? This info will help
to configure which interface to use on all
3 machines.

thanks,
mark
mkellyhpcc
 
Posts: 15
Joined: Mon Mar 10, 2014 2:51 pm

Tue Feb 17, 2015 2:58 pm Change Time Zone

Also, if you could please run this command on .70 and .167 (the thormaster and thorslave.) Then post the output. I'm assuming that X.X.X.167 is the ip of your dali node according to the xml you gave us earlier.

Code: Select all
sudo /opt/HPCCSystems/bin/daliadmin X.X.X.167 dfsgroup mythor
mgardner
 
Posts: 13
Joined: Tue Jan 20, 2015 9:30 pm

Next

Return to Clustering

Who is online

Users browsing this forum: No registered users and 1 guest