THOR Log & Information THOR component keeps stopping
Hi There,
I am having some problems with one of my Clusters, this is the only one we have running on :
Ubuntu 18.04
HPCC Community 7.4.8-1
Every couple or days the THOR service will stop and I have to run
sudo service hpcc-init -c mythor stop / Start
Sometimes I need to run this command many times for THOR to start and stay started.
I am trying to fund out why this might be happening
In ECL watch I am only getting errors like :
Source Severity Code Message FileName LineNo Column id
eclagent Error 0 Abort: 0: Workunit abort request received 0 0 0
eclagent Warning 0 Abort takes precedence over error: 0: Query W20200107-135137 cancelled (1) (in item 10) 0 0 1
eclagent Info 0 PERSIST('~XXX::special::XXXidentdedup3') is up to date 0 0 2
I am looking for more detailed information to see why.
I have had a look in these directories & log files but can’t see anything that helps.
/var/log/HPCCSystems/mythor
/var/log/HPCCSystem/hpcc-init.log
/var/log/HPCCSystems/cluster
I have also tried to see whats entered into the Sys log :
sudo cat /var/log/syslog |tail
Can you help point me in the right direction to get more detailed information?
Thanks in advance.
I am having some problems with one of my Clusters, this is the only one we have running on :
Ubuntu 18.04
HPCC Community 7.4.8-1
Every couple or days the THOR service will stop and I have to run
sudo service hpcc-init -c mythor stop / Start
Sometimes I need to run this command many times for THOR to start and stay started.
I am trying to fund out why this might be happening
In ECL watch I am only getting errors like :
Source Severity Code Message FileName LineNo Column id
eclagent Error 0 Abort: 0: Workunit abort request received 0 0 0
eclagent Warning 0 Abort takes precedence over error: 0: Query W20200107-135137 cancelled (1) (in item 10) 0 0 1
eclagent Info 0 PERSIST('~XXX::special::XXXidentdedup3') is up to date 0 0 2
I am looking for more detailed information to see why.
I have had a look in these directories & log files but can’t see anything that helps.
/var/log/HPCCSystems/mythor
/var/log/HPCCSystem/hpcc-init.log
/var/log/HPCCSystems/cluster
I have also tried to see whats entered into the Sys log :
sudo cat /var/log/syslog |tail
Can you help point me in the right direction to get more detailed information?
Thanks in advance.
- amillar
- Posts: 34
- Joined: Fri Oct 16, 2015 7:32 am
amillar,
This is something you should report in JIRA. That will get it directly to the attention of the developers.
HTH,
Richard
This is something you should report in JIRA. That will get it directly to the attention of the developers.
HTH,
Richard
- rtaylor
- Community Advisory Board Member
- Posts: 1619
- Joined: Wed Oct 26, 2011 7:40 pm
Would you please check for cores in /var/lib/HPCCSystems/<name of your thor>
Also would you post the contents of /var/log/HPCCSystems/<name of your
thor>/init_thorXXXX
and the thormaster.log of when the thor is going down.
thanks
-F
Also would you post the contents of /var/log/HPCCSystems/<name of your
thor>/init_thorXXXX
and the thormaster.log of when the thor is going down.
thanks
-F
- fernando
- Posts: 6
- Joined: Thu Jun 19, 2014 1:29 pm
Hi Fernando,
Thanks for getting back to me.
We have been having problems over the last 24hrs, so while I was waiting I have upgraded the platform from 7.4.8-1 to 7.6.16-1 to give it a try, I was still experiencing the same problems, THOR starts and then STOPS.
I have had a look in /var/lib/HPCCSystems/mythor and there is a file named core - its dated 15th Aug 19 and is 0 bytes - is that to be expected?
I have also looked here /var/log/HPCCSystems/mythor - initially the issue seemed to be that the slaves failed to initialise
8379 2020_01_03_16_09_59: Starting mythor
8379 2020_01_03_16_09_59: removing any previous sentinel file
8379 2020_01_03_16_09_59: Ensuring a clean working environment ...
8379 2020_01_03_16_09_59: Killing slaves
8379 2020_01_03_16_09_59: --------------------------
8379 2020_01_03_16_09_59: starting thorslaves ...
8379 2020_01_03_16_10_02: thormaster cmd : /var/lib/HPCCSystems/mythor/thormaster_mythor MASTER=192.168.20.35:20000
8379 2020_01_03_16_10_02: thormaster_lcr process started pid = 9577
8379 2020_01_03_16_10_05: Thormaster (9577) Slaves failed to initialize
8379 2020_01_03_16_10_05: Shutting down
8379 2020_01_03_16_10_05: Stopping mythor
8379 2020_01_03_16_10_05: mythor Stopped
8379 2020_01_03_16_10_05: Killing slaves
8379 2020_01_03_16_10_07: Frunssh successful
8379 2020_01_03_16_10_07: removing init.pid file and slaves file
however after stopping PID's under HPCC user, and closing open ports on the other nodes I did get the platform to start.
So far everything seems to be stable.
Thanks for your help.
Antony
Thanks for getting back to me.
We have been having problems over the last 24hrs, so while I was waiting I have upgraded the platform from 7.4.8-1 to 7.6.16-1 to give it a try, I was still experiencing the same problems, THOR starts and then STOPS.
I have had a look in /var/lib/HPCCSystems/mythor and there is a file named core - its dated 15th Aug 19 and is 0 bytes - is that to be expected?
I have also looked here /var/log/HPCCSystems/mythor - initially the issue seemed to be that the slaves failed to initialise
8379 2020_01_03_16_09_59: Starting mythor
8379 2020_01_03_16_09_59: removing any previous sentinel file
8379 2020_01_03_16_09_59: Ensuring a clean working environment ...
8379 2020_01_03_16_09_59: Killing slaves
8379 2020_01_03_16_09_59: --------------------------
8379 2020_01_03_16_09_59: starting thorslaves ...
8379 2020_01_03_16_10_02: thormaster cmd : /var/lib/HPCCSystems/mythor/thormaster_mythor MASTER=192.168.20.35:20000
8379 2020_01_03_16_10_02: thormaster_lcr process started pid = 9577
8379 2020_01_03_16_10_05: Thormaster (9577) Slaves failed to initialize
8379 2020_01_03_16_10_05: Shutting down
8379 2020_01_03_16_10_05: Stopping mythor
8379 2020_01_03_16_10_05: mythor Stopped
8379 2020_01_03_16_10_05: Killing slaves
8379 2020_01_03_16_10_07: Frunssh successful
8379 2020_01_03_16_10_07: removing init.pid file and slaves file
however after stopping PID's under HPCC user, and closing open ports on the other nodes I did get the platform to start.
So far everything seems to be stable.
Thanks for your help.
Antony
- amillar
- Posts: 34
- Joined: Fri Oct 16, 2015 7:32 am
4 posts
• Page 1 of 1
Who is online
Users browsing this forum: No registered users and 1 guest