Wed Oct 20, 2021 7:07 pm
Login Register Lost Password? Contact Us


Thor Slave won't start

Post questions specific to installation or configuration for the HPCC Systems platform

Fri Jul 23, 2021 4:56 pm Change Time Zone

Hello!

I've been trying to install new version of HPCCSystems Platform on Ubuntu 20.x but I'm facing an issue where the Thor Slave just won't start.
I tried HPCCSystems Platform 8.2, 8.0 and 7.12 (latest for each) on Ubuntu 20.04 and 20.10 but I get the same behavior.
Every time I simply do:
Code: Select all
dpkg -i hpccsystems-platform....
apt install -f
systemctl start hpccsystems-platform.service

When I run preflight certification, it shows Thor Slave is not ready.
Doing simple ps auxwww | grep hpcc I get the following:
Code: Select all
hpcc       50730  0.0  0.0 130820  6532 ?        Ssl  16:20   0:00 /opt/HPCCSystems/bin/dafilesrv -L /var/log/HPCCSystems -I mydafilesrv -D
hpcc       50745  0.0  0.0 577052  9004 ?        Ssl  16:20   0:00 /opt/HPCCSystems/bin/eclccserver --daemon myeclccserver
hpcc       50748  0.0  0.0 355816  8732 ?        Ssl  16:20   0:00 /opt/HPCCSystems/bin/agentexec --daemon myeclagent
hpcc       50754  0.0  0.2 2108300 38356 ?       Ssl  16:20   0:00 /opt/HPCCSystems/bin/daserver --daemon mydali
hpcc       50755  0.0  0.1 540264 19944 ?        Ssl  16:20   0:00 /opt/HPCCSystems/bin/dfuserver --daemon mydfuserver
hpcc       50757  0.0  0.2 2279816 39692 ?       Ssl  16:20   0:00 /opt/HPCCSystems/bin/roxie --topology=RoxieTopology.xml --logfile --restarts=2 --stdlog=0 --daemon myroxie
hpcc       50760  0.0  0.0 536128  8444 ?        Ssl  16:20   0:00 /opt/HPCCSystems/bin/eclscheduler --daemon myeclscheduler
hpcc       50769  0.0  0.0  86972  3352 ?        Ssl  16:20   0:00 /opt/HPCCSystems/bin/toposerver --daemon mytoposerver
hpcc       50775  0.0  0.2 993604 46228 ?        Ssl  16:20   0:00 /opt/HPCCSystems/bin/esp --daemon myesp
hpcc       51046  0.0  0.1 4292344 22896 ?       Ssl  16:20   0:00 /opt/HPCCSystems/bin/thormaster_lcr --daemon mythor MASTER=172.32.5.210:20000


The content of the /var/log/HPCCSystems/mythor/thorslaves-launch.debug is like this:
Code: Select all
+ [[ -z mythor ]]
+ [[ -z start ]]
++ pwd
+ cwd=/var/lib/HPCCSystems/mythor
+ [[ /var/lib/HPCCSystems/mythor != \/\v\a\r\/\l\i\b\/\H\P\C\C\S\y\s\t\e\m\s\/\m\y\t\h\o\r ]]
+ source mythor.cfg
++ PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/opt/HPCCSystems/bin:/opt/HPCCSystems/sbin:/var/lib/HPCCSystems/mythor
++ THORNAME=mythor
++ THORMASTER=172.32.5.210
++ THORMASTERPORT=20000
++ THORSLAVEPORT=20100
++ localthorportinc=20
++ slavespernode=1
++ channelsperslave=1
++ DALISERVER=172.32.5.210:7070
++ localthor=true
++ breakoutlimit=3600
++ refreshrate=3
++ autoSwapNode=false
++ SSHidentityfile=/home/hpcc/.ssh/id_rsa
++ SSHusername=hpcc
++ SSHpassword=
++ SSHtimeout=0
++ SSHretries=3
++ SSHsudomount=
+ slaveIps=($(/opt/HPCCSystems/bin/daliadmin server=$DALISERVER clusternodes ${THORNAME} slaves timeout=2 1>/dev/null 2>&1; uniq slaves))
++ /opt/HPCCSystems/bin/daliadmin server=172.32.5.210:7070 clusternodes mythor slaves timeout=2
++ uniq slaves
+ [[ -z 172.32.5.210 ]]
+ [[ -z 172.32.5.210 ]]
+ numOfNodes=1
+ (( i=0 ))
+ (( i<1 ))
+ (( c=0 ))
+ (( c<1 ))
+ __slavePort=20100
+ __slaveNum=1
+ ssh -o LogLevel=QUIET -o StrictHostKeyChecking=no -o BatchMode=yes -i /home/hpcc/.ssh/id_rsa hpcc@172.32.5.210 '/bin/bash -c '\''/opt/HPCCSystems/sbin/thorslaves-exec.sh start thorslave_mythor_1 20100 1 mythor 172.32.5.210 20000'\'''
(...)
+ exit 0

(had to remove some lines from it to be able to submit this post).

I can run manually the ssh command from above, or even directly the thorslaves-exec.sh (with all the right values) but nothing shows up (no errors, no output). I ran the command that thorslaves-exec.sh runs, systemctl start thorslave@thorslave_mythor_1.service, and here is its status:
Code: Select all
● thorslave@thorslave_mythor_1.service - thorslave_mythor_1
     Loaded: loaded (/etc/systemd/system/thorslave@.service; static)
     Active: failed (Result: exit-code) since Fri 2021-07-23 16:30:48 UTC; 10min ago
    Process: 53104 ExecStart=/opt/HPCCSystems/bin/thorslave_lcr --daemon thorslave_mythor_1 master=${THORMASTER}:${THORMASTERPORT} slave=.:${SLAVEPORT} slavenum=${SLAVENUM} logDir=/var/log/HPCCSystems/${THORNAME} (code=exited, status=1/FAILURE)
   Main PID: 53104 (code=exited, status=1/FAILURE)

Jul 23 16:30:48 ip-172-32-5-210 systemd[1]: Started thorslave_mythor_1.
Jul 23 16:30:48 ip-172-32-5-210 systemd[1]: thorslave@thorslave_mythor_1.service: Main process exited, code=exited, status=1/FAILURE
Jul 23 16:30:48 ip-172-32-5-210 systemd[1]: thorslave@thorslave_mythor_1.service: Failed with result 'exit-code'.


Any idea why the slave would not start? Any idea how I could get more logs here to understand what's going on?

Thanks!
lpezet
 
Posts: 75
Joined: Wed Sep 10, 2014 3:14 am

Fri Jul 23, 2021 7:14 pm Change Time Zone

I've now gone down all the way to HPCCSystems 7.8 on Ubuntu 20.04 and still getting the same behavior WHEN USING systemctl (as mentioned in the doc: https://cdn.hpccsystems.com/releases/CE ... .2.2-1.pdf).
Now I went back to HPCCSystems 8.2/Ubuntu 20.04, but this time using the old school /etc/init.d/hpcc-init start and it worked!
Here are the processes I get for hpcc user:
Code: Select all
hpcc       25704  0.0  0.0   9672  4372 pts/0    S    19:06   0:00 /bin/bash /opt/HPCCSystems/bin/init_dafilesrv
hpcc       25743  0.0  0.1 138536 16424 pts/0    Sl   19:06   0:00 dafilesrv -L /var/log/HPCCSystems -I mydafilesrv
hpcc       25887  0.0  0.0   9672  4388 pts/0    S    19:06   0:00 /bin/bash /opt/HPCCSystems/bin/init_dali
hpcc       25924  0.0  0.2 764380 47800 pts/0    Sl   19:06   0:00 daserver
hpcc       26085  0.0  0.0   9672  4392 pts/0    S    19:06   0:00 /bin/bash /opt/HPCCSystems/bin/init_dfuserver
hpcc       26122  0.0  0.1 604864 24360 pts/0    Sl   19:06   0:00 dfuserver
hpcc       26283  0.0  0.0   9672  4464 pts/0    S    19:06   0:00 /bin/bash /opt/HPCCSystems/bin/init_eclagent
hpcc       26323  0.0  0.0 421156 14400 pts/0    Sl   19:06   0:00 agentexec
hpcc       26472  0.0  0.0   9672  4432 pts/0    S    19:06   0:00 /bin/bash /opt/HPCCSystems/bin/init_eclccserver
hpcc       26509  0.0  0.0 576856 14688 pts/0    Sl   19:06   0:00 eclccserver
hpcc       26674  0.0  0.0   9672  4444 pts/0    S    19:06   0:00 /bin/bash /opt/HPCCSystems/bin/init_eclscheduler
hpcc       26711  0.0  0.0 601496 14356 pts/0    Sl   19:06   0:00 eclscheduler
hpcc       26866  0.0  0.0   9672  4228 pts/0    S    19:06   0:00 /bin/bash /opt/HPCCSystems/bin/init_esp
hpcc       26903  0.0  0.3 756612 57512 pts/0    Sl   19:06   0:00 esp snmpid=26866
hpcc       27392  0.0  0.0   9672  4292 pts/0    S    19:06   0:00 /bin/bash /opt/HPCCSystems/bin/init_roxie
hpcc       27434  0.0  0.2 1771348 44128 pts/0   Sl   19:06   0:00 roxie --topology=RoxieTopology.xml --logfile --restarts=0 --stdlog=0
hpcc       27608  0.0  0.0   9672  4340 pts/0    S    19:06   0:00 /bin/bash /opt/HPCCSystems/bin/init_sasha
hpcc       27645  0.0  0.0 617888 14940 pts/0    Sl   19:06   0:00 saserver
hpcc       27807  0.0  0.0   9672  4396 pts/0    S    19:06   0:00 /bin/bash /opt/HPCCSystems/bin/init_thor
hpcc       27952  0.0  0.1 8577428 27952 pts/0   Sl   19:06   0:00 ./thorslave_mythor --master=172.32.5.233:20000 --slave=.:20100 --slavenum=1 --slaveprocessnum=0 --logDir=/var/log/HPCCSystems/mythor
hpcc       27957  0.0  0.1 4701252 28336 pts/0   Sl   19:06   0:00 /var/lib/HPCCSystems/mythor/thormaster_mythor --master=172.32.5.233:20000
hpcc       28135  0.0  0.0   9672  4364 pts/0    S    19:06   0:00 /bin/bash /opt/HPCCSystems/bin/init_toposerver
hpcc       28172  0.0  0.0  87080  8876 pts/0    Sl   19:06   0:00 toposerver


Preflight/certification is all good to.
Why, oh why?
lpezet
 
Posts: 75
Joined: Wed Sep 10, 2014 3:14 am

Fri Jul 23, 2021 10:42 pm Change Time Zone

Ipezet, thanks for bringing this up. I've opened a Jira ticket and I'll be investigating the issue. https://track.hpccsystems.com/browse/HPCC-26258

Reading the info you provided, it looks like there isn't actually an issue with the ssh call going through, the error is in the thorslaves-exec.sh script?
mgardner
 
Posts: 16
Joined: Tue Jan 20, 2015 9:30 pm

Mon Jul 26, 2021 6:28 am Change Time Zone

Hello!

I would say thorslaves-exec.sh runs fine, and it's something with /opt/HPCCSystems/bin/thorslave_lcr. When I try to run manually /opt/HPCCSystems/bin/thorslave_lcr I can't get much from it (besides its usage if I don't pass the right parameters): exit code is always 0 and no std/error output.
lpezet
 
Posts: 75
Joined: Wed Sep 10, 2014 3:14 am


Return to Installation

Who is online

Users browsing this forum: No registered users and 1 guest

cron