Wed Jul 08, 2020 10:57 am
HPCC System Performance Monitoring & Nagios

Questions related to node architecture, redundancy and system monitoring

Fri Feb 07, 2020 4:05 pm

Hi There,

I am looing to get more performance out of my HPCC cluster and am trying find out where the performance issues could be coming from CPU, MEM or Disk.

Our THOR Cluster set-up is :

10 x HP BL460c G7 Blade machines - 2 x 6 Core CPU (Xeon X5650 2.67ghz), 48GB MEM, 480GB SSD (mirrored)

HPCC V6.4.2-1 running on Ubuntu 14.04.02
1 Machine is the THOR master, ECL server etc.
1 Machine is configured as a Spare.

Our Environment is set with the defaults for memory 75% and each physical node is running 6 Slaves. (slavesPerNode="6")

I am looking for some tools that will help me see where the potential slowdowns could be, our Data Science team have worked through the ECL code to make it as efficient as possible over the last 6months or so.

I have been using the ECL watch to look at the Graphs and I notice a lot of “spills” which to me says we are running out of memory, however when I run tools such at HTOP and IOSTAT I don’t see the memory being exhausted, and the CPU load is normally pretty low, with the odd momentary peak at around 50% when Work units are running.

I have tried to follow the documentation on the Nagios install reading HPCC Monitoring and Reporting 6.4 but I cannot seem to be able to execute all of the commands.

Such as :

Generate a host groups configuration for Nagios.

/opt/HPCCSystem/bin/hpcc-nagios-tools -env \
/etc/HPCCSystems/environment.xml -g -out /etc/nagios3/config.d/hpcc_hostgroups.cfg

Generate a services configuration file.

/opt/HPCCSystem/bin/hpcc-nagios-tools -env \
/etc/HPCCSystems/environment.xml -s -out /etc/nagios3/config.d/hpcc_services.cfg

Generate an escalation notifications file.

./hpcc-nagios-tools -ec -env /etc/HPCCSystems/environment.xml \ -enable_host_notify -enable_service_notify -set_url localhost/nagios3 \ -disable_check_all_disks –out /etc/nagios3/conf.d/hpcc_notifications.cfg

I have since installed Nagios Core onto another server and installed NRPE so I can monitor – CPU, DISK, MEM, SWAP, CPU load on remote hosts, I have been struggling getting all of this to work over the last few days with only the Free Mem and CPU stats still not working.

I am really keen to see the Dali, DFU, ECL Agent, CC, Scheduler stats within Nagios.

I am trying to create a dashboard of my system so I can see the trends overtime to spot contention points and action them.

I was wondering if someone can help with my Nagios set-up or whether I should be looking into another solution?

I am happy to share my environment config as there maybe areas that needs tweaking, setting or enabling. I am also happy to try and suopply any tool output if that helps?

Any advice or feedback would be greatly appreciated.


Mon Feb 10, 2020 9:22 pm

Hi Anthony, the ganglia feature is limited to system health and roxie specific metrics. We're currently working on a non-ganglia mechanism for component health metrics reporting, but I can't provide any timeline on that right now.

There are ways to fetch metrics off of our component logs via filebeats -> Elastic stack. Let me know if you're interested going down that road. Thanks.
