Tue Jan 28, 2020 1:22 am
Login Register Lost Password? Contact Us


THOR Log & Information THOR component keeps stopping

Questions related to node architecture, redundancy and system monitoring

Wed Jan 08, 2020 2:47 pm Change Time Zone

Hi There,

I am having some problems with one of my Clusters, this is the only one we have running on :

Ubuntu 18.04
HPCC Community 7.4.8-1

Every couple or days the THOR service will stop and I have to run

sudo service hpcc-init -c mythor stop / Start

Sometimes I need to run this command many times for THOR to start and stay started.

I am trying to fund out why this might be happening

In ECL watch I am only getting errors like :

Source Severity Code Message FileName LineNo Column id
eclagent Error 0 Abort: 0: Workunit abort request received 0 0 0
eclagent Warning 0 Abort takes precedence over error: 0: Query W20200107-135137 cancelled (1) (in item 10) 0 0 1
eclagent Info 0 PERSIST('~XXX::special::XXXidentdedup3') is up to date 0 0 2

I am looking for more detailed information to see why.

I have had a look in these directories & log files but can’t see anything that helps.

/var/log/HPCCSystems/mythor

/var/log/HPCCSystem/hpcc-init.log

/var/log/HPCCSystems/cluster

I have also tried to see whats entered into the Sys log :

sudo cat /var/log/syslog |tail

Can you help point me in the right direction to get more detailed information?

Thanks in advance.
amillar
 
Posts: 22
Joined: Fri Oct 16, 2015 7:32 am

Wed Jan 08, 2020 4:45 pm Change Time Zone

amillar,

This is something you should report in JIRA. That will get it directly to the attention of the developers.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1508
Joined: Wed Oct 26, 2011 7:40 pm

Wed Jan 08, 2020 6:52 pm Change Time Zone

Would you please check for cores in /var/lib/HPCCSystems/<name of your thor>

Also would you post the contents of /var/log/HPCCSystems/<name of your

thor>/init_thorXXXX

and the thormaster.log of when the thor is going down.




thanks

-F
fernando
 
Posts: 6
Joined: Thu Jun 19, 2014 1:29 pm

Fri Jan 10, 2020 10:47 am Change Time Zone

Hi Fernando,

Thanks for getting back to me.

We have been having problems over the last 24hrs, so while I was waiting I have upgraded the platform from 7.4.8-1 to 7.6.16-1 to give it a try, I was still experiencing the same problems, THOR starts and then STOPS.

I have had a look in /var/lib/HPCCSystems/mythor and there is a file named core - its dated 15th Aug 19 and is 0 bytes - is that to be expected?

I have also looked here /var/log/HPCCSystems/mythor - initially the issue seemed to be that the slaves failed to initialise

8379 2020_01_03_16_09_59: Starting mythor
8379 2020_01_03_16_09_59: removing any previous sentinel file
8379 2020_01_03_16_09_59: Ensuring a clean working environment ...
8379 2020_01_03_16_09_59: Killing slaves
8379 2020_01_03_16_09_59: --------------------------
8379 2020_01_03_16_09_59: starting thorslaves ...
8379 2020_01_03_16_10_02: thormaster cmd : /var/lib/HPCCSystems/mythor/thormaster_mythor MASTER=192.168.20.35:20000
8379 2020_01_03_16_10_02: thormaster_lcr process started pid = 9577
8379 2020_01_03_16_10_05: Thormaster (9577) Slaves failed to initialize
8379 2020_01_03_16_10_05: Shutting down
8379 2020_01_03_16_10_05: Stopping mythor
8379 2020_01_03_16_10_05: mythor Stopped
8379 2020_01_03_16_10_05: Killing slaves
8379 2020_01_03_16_10_07: Frunssh successful
8379 2020_01_03_16_10_07: removing init.pid file and slaves file

however after stopping PID's under HPCC user, and closing open ports on the other nodes I did get the platform to start.

So far everything seems to be stable.

Thanks for your help.

Antony
amillar
 
Posts: 22
Joined: Fri Oct 16, 2015 7:32 am


Return to System Health

Who is online

Users browsing this forum: No registered users and 1 guest

cron