Troubleshooting performance on Thor – A Case Study
Recently, a good friend of mine asked if I would take a look at their system. They had doubled the amount of memory of their one node, multithreaded HPCC Systems Thor installation, believing that the additional memory would help speed up a particular time consuming process, but it made no significant difference.
In general, additional memory can help in a number of ways:
- By serving as file system cache to speed up repeated access to files.
- It can reduce the use of block devices for swap space when the system is under memory pressure.
- When configured in config environment, it can help minimize the use of spills when performing activities that demand high memory utilization.
In this particular case, after adding the extra memory, which effectively doubled the amount of RAM that the system had and by adjusting config environment as recommended in the HPCC Systems installation and tuning guide, the performance of the workunits in question barely improved. Puzzled, my friend dropped to the Linux command line and saw that the system still had roughly 50% of free RAM, which concerned him since the additional memory had seemingly no impact and was going to waste.
A side note on linux memory allocation
When a process indicates to the linux kernel that it needs to allocate memory in the heap, it uses the malloc() function or one of its relatives. The kernel earmarks that requested memory but, usually doesn’t provision it until the requesting process actually writes content into it and, even at that time, it will only allocate the modified memory pages and not the entire space.
Merely requesting memory through malloc() doesn’t force the kernel to provision it on the fly (this behavior can be affected through system settings to prevent over subscription, if necessary).
Going back to the ask from my friend
Memory is actually used when the system is under memory pressure. Since virtual memory is composed of RAM and paging space in the block device, memory pressure is usually translated into high IO that can be evidenced through instant (from the sysstat package). In this particular case, the IO subsystem was barely breathing with no substantial activity (low service times, short service queues, etc.), which indicated that the extra memory wasn’t needed either as file system cache, virtual memory supplement or mitigating spills in the system.
So, what then was taking all of that time in that workunit?
After drilling down to a long running subgraph in ECL Watch and inspecting the ECL code, an embedded python function part of a project/transform caught my attention.
It was inserting every record in a relatively large (but very small relative to the total amount of RAM in the system) dataset into a remote (and slow) ELK stack synchronously. This function was taking roughly 95% of the total workunit runtime. Even though, this function was bundling 100 records per request, the system was negotiating millions of TLS sessions throughout the course of this activity.
The solution in this case
Push this into an asynchronous message queue and/or, as a stopgap solution, increase the size of the bundles to reduce the number of TLS session negotiations.
In more general terms, if you are looking for ways to improve the performance of your HPCC Systems Thor environment, one of the first things to assess is which workunits take the most time.
Using the ECL Watch timings interface, you can drill down into individual sub graphs and activities. Once you identify the particular activities with the longest execution time, taking a look at the ECL code can quickly let you spot the sections that could be problematic. After all, if you can reasonably speed up something by 10% with a given degree of effort, if that thing represents 90% of the total execution time, you would be speeding up that work unit by 9%. But if you focus on optimizing something that only represents 10% of the total execution time, your 10% improvement over that would just render a meager 1% of overall speedup.
Additionally, once you have identified the subgraph responsible for the most execution time, looking at the corresponding data flow graph can give you more information. For example, a significant skew with a long tail that forces a specific node to deal with a disproportional amount of work (after all, your workunit will be as fast as your slowest node). If this is the case, a different distribution could work wonders without the need to spend a single dime on that hardware.
Upgrading your hardware is not always the most appropriate optimization route and should be only utilized after the diagnostics indicate that it is the most sensible route in a particular case.