HPCC Systems Thor Monitor – Using Workunit Services and Power BI to Monitor Thor Activity

Image - IntroEnterprise Control Language (ECL) Workunit Services standard library functions can be used to capture details about workunits running on Thor including processing time, errors, current state, and more. Capturing these details allows for monitoring, trending, error analysis, degradation, and other data points that can help improve the efficiency of Thor environments.  In this blog, we will look at how to use this information to monitor the system with visualizations in Power BI.

Power BI is a visualization tool that allows the user to connect, model, and explore data with live dashboards and reports. This tool enables the user to publish, collaborate, and share data.

Jessica Skaggs, a consulting Software Engineer at LexisNexis Risk Solutions, introduced her work, “HPCC Systems Thor Monitor – Using Workunit Services and Power BI to Monitor Thor Activity” at HPCC Systems Tech Talk 32. The full recording of Jessica’s Tech Talk is available on YouTube. 

In this blog post we:

  • Look at a demo of the HPCC Systems Thor Monitor in Power BI
  • Discuss Workunit Services standard library functions
  • Examine flow of data from Thor to Power BI

HPCC Systems Thor Monitor Demo

This Thor monitor demonstration illustrates the following:

  • Potential Problem Workunits 
  • Time by Date
  • Graph Timings
  • Errors

Image 1 is a view of the Thor monitor in Power BI. There are multiple pages to the report that can be seen along the bottom of the screen.

This view shows the collective workunits and the information collected about them.

The drop downs on the left allow for filtering of Environment, Cluster, State, Owner, and ECL Version. There are other details that can be filtered, such as Date Range and Job Name (shown at the top of the image).

Data is collected from multiple Thor environments. The Thor environment selected for this demonstration is “Alpha Prod.”

Image 1 – Thor Monitor in Power BI
Image 1 – Thor Monitor in Power BI

Image 2 shows how the “Owner” filter can be used to filter the workunits produced by a specific owner. In this image, the “Owner” filter was used to show workunits for “astephens.” 

Image 2 – “Owner” Filter
Image 2 – “Owner” Filter

 

Potential Problem WUs

Image 3 shows the “Total Time” that the workunits took to run in seconds. 

The “Std Dev Total Time” (standard deviation total time) and the “Average Total Time” are calculated for all workunits with the same “Job Name.”

The “Z Score” is the number of standard deviations away from the average for a workunit’s total runtime. The higher the number, the more time it took for the job to run, compared to average runtime. 

For example, the first workunit took 246 seconds to run (“Total Time). The average time for the 110 workunits (“Num Like WUs”) with the same “Job Name” is 26.73 seconds. The “Total Time” for this workunit is 9.83 standard deviations (“Z Score”) away from the “Average Total Time,” so this workunit took longer than normal.

This gives the user the ability to identify potential issues. If there are numerous high Z Scores, it may point to an issue on the system as a whole. If a specific job has a number of high Z Scores, the issue may be with that particular job. 

The workunits can be filtered for a job with a specific name to see how it has been performing over time.

The table can also be sorted by date, instead of Z-Score, to see if the runtime has been going up, or if the problem workunits are simply anomalies. 

Image 3 – “Total Time”
Image 3 – “Total Time”

Time by Date

In image 4, the data has been filtered to show the total workunit runtimes by date, for date range 3/13/2020 to 3/19/2020. 

The top visual shows the total runtimes in seconds for each day in the date range, with March 17th showing as the busiest. This is for all workunits in this environment.

The bottom visual breaks down the total time to show the “compile time,” “Thor time,” “process time,” and “dfu time.” 

Each visual can be viewed at a year, month, day, or individual workunit level.  

Image 4 – Workunit Runtimes by Date
Image 4 – Workunit Runtimes by Date

Image 5 shows workunit data over an 8 week period. 

Image 5 – Workunit Data over an 8 Week Period
Image 5 – Workunit Data over an 8 Week Period

Image 6 shows month to month workunit runtimes. 

In this view, if the user hovers over the words “Total Time (seconds) by date, a button will appear. This button is used to designate total runtime desired for a specific view. 

Image 6 – Month to Month Workunit Runtimes
Image 6 – Month to Month Workunit Runtimes

Image 7 shows workunits filtered to a specific “Job Name.”

This view has been filtered to job name “CLUE Auto Daily Build.”

Image 7 – Workunits Filtered to a Specific “Job Name”
Image 7 – Workunits Filtered to a Specific “Job Name”

In Image 8, data is filtered to show runtimes by workunit. 

Image 8 – Runtimes by Workunit
Image 8 – Runtimes by Workunit

Graph Timings

The page in image 9 allows the user to compare “Subgraph” timings for a job, over time, to investigate issues. 

The job name filter, “CLUE Auto Daily Build” was retained when changing pages, so this view shows the workunits for that specific job name.

The bar shows the total time for a specific workunit, and each color in the bar shows a “Subgraph” duration. This allows for side-by-side comparison of how subgraph timings have changed.

The off-color bar near the middle was restarted at a different point, so the graphs do not line up.  

Image 9 – Subgraph Timings
Image 9 – Subgraph Timings

In image 10, hovering over a specific area on the graphs allows the user to see information about that subgraph.

Image 10 – Subgraph Information
Image 10 – Subgraph Information

Image 11 shows a filter for graph 14.

This allows for a clearer view of the subgraphs. 

Image 11 – Graph 14 Filter View
Image 11 – Graph 14 Filter View

Error Page

Image 12 is a view of the error page showing all jobs.

The right side of the page shows workunits with error (“WUs with Error”), by count and percentage.

Out of the total number of workunits run for the specified date range (3/13/2020 – 3/19/2020), 147 have failed, which is an error rate of 1.49%.

The pie chart shows errors by error codes, which allows the user to see which errors occur most often.

The chart near the top shows the job names with at least one error, by total number of workunits (“Total WUs”), workunits with errors (“WU’s with Error”), and the percentage of workunits with errors (“% with Error”). 

Image 12 – Error Page
Image 12 – Error Page

In image 13, when a specific Job Name is selected in the top chart, the bottom chart shows the workunit (“WUID”) and the corresponding error message (“Error Message”), along with the “Owner,” “ECL Version,” and “Error Code.”

Image 13 – Error Page View when a specific “Job Name” is Selected
Image 13 – Error Page View when a specific “Job Name” is Selected

Now, let’s discuss the standard workunit services standard library functions used to collect the data that supports the monitor.

Workunit List

std.system.workunit.WorkunitList

The workunit list function:

  • Returns a list of workunit IDs “WUIDs” with details including state, cluster, owner, and job name
  • Allows the user to set parameters to collect the list of WUIDs based on attributes

STD.System.Workunit.WorkunitList (  lowwuid  [ ,  highwuid  ] [ ,  username ] [ ,  cluster ] [ ,  jobname  ] [ ,  state  ]
[ , priority ] [ , fileread ] [ , filewritten ] [ , roxiecluster ] [ , eclcontains ] [ , online ] [ , archived ] [ , appvalues] }

For example, a user can pass a username parameter and only get workunits for that specific user. The user can also specify whether or not to return workunits that have been archived.  

In this example, the low workunit ID “lowwuid” and high workunit ID “highwuid” parameters are used to capture a list of non-archived WUIDs that have run on Thor within a date/time range. The user passed the start date of the low workunit ID and the end date of high workunit ID. The result is a list of workunits that have run in that range. 

Workunits := std.system.workunit.WorkunitList (‘W’ + startDate + ‘-000000’, ‘W’ + endDate + ‘-999999’)’

Image 14 – WorkunitList Function Output
Image 14 – WorkunitList Function Output

When there are a list of specified workunit ID’s, other functions can be used to get details about those workunits.

Workunit Timings

std.system.workunit.WorkunitTimings

Image 15 – WorkunitTimings Function Output
  Image 15 – WorkunitTimings Function
                     Output

The workunit timings function returns time spent in each step for a given WUID.

The graph on the right shows how this function reports the graph timings. 

The information in the graph was parsed to collect the duration for each graph and subgraph, to support the graph timings page, as discussed earlier. 

  • Time spent in graph 1, subgraph 1 is highlighted in (yellow)
  • If the WUID ran on hthor, name may look different  (like w1: graph1: sg1), so the user used two regular expressions to pick up either value. 
  • Compile time is highlighted in (purple)
  • Processing time is highlighted in (pink)
  • DFU time is the sum of entries with Spray or despray in the name is highlighted in (teal)                                

 

Workunit TimeStamps

std.system.workunit.WorkunitTimeStamps

The workunit timestamps function returns timestamps for steps in the given workunit ID.

  • Start time is the earliest time with “id” showing as “Created” and an empty “application” value (in yellow)
  • End time is the latest time with “id” showing as  “Finished,” and an empty “application” value (in purple)
  • here may be more than one “finished” time, so be sure to sort the data to capture the latest time.
Image 16 – WorkunitTimestamps Function Output
Image 16 – WorkunitTimestamps Function Output

Workunit Messages

std.system.workunit.WorkunitMessages

A workunit message function:

  • Returns any messages associated with a given WUID
  • Includes informational messages, warnings, and errors. Errors have a severity of 2

A WUID may have more than one message. It can have an informational message, as well as an error message. There can also be multiple error messages associated with the workunit ID. 

Image 17 – WorkunitMessages Function Output  
Image 17 – WorkunitMessages Function Output  

Example: Code

In this example, the user uses workunit services functions to capture error messages for all workunits that ran between February 26, 2020 and February 27, 2020. 

  • First, a list of workunits is captured in the date range using the WorkunitList function. 
  • The list is then projected and the other workunit service functions are called within the “transform” to create child datasets, with the timing, timestamp, and message details for each workunit.
  • If you’re not running on “hThor,” wrap the workunit services call in “noThor,” because workunit service functions cannot run from a Thor slave.
  • Once the list of workunits has been created, the messages from the child dataset are normalized, messages that are not errors are skipped over, and newlines and tabs from the messages are stripped. 
/*************************************************************************************************
Get Workunit list with child datasets for WorkunitTimings,
WorkunitTimeStamps, and WorkunitMessages 
*************************************************************************************************/
export getWorkunitList(unsigned4 startDate, unsigned4 endDate) := function

  workunits := std.system.workunit.WorkunitList(‘W’ + ‘-000000’, ‘W’ + endDate + ‘-999999’);

  wuListLayout getWUList (recordof (workunits)  L)  := transform
      self.wuTimings    :=  std.system.workunit.WorkunitTimings(trim (L.wuid, right, left));
      self.wuTimestamp  :=  std.system.workunit.WorkunitTimestamps(trim (L.wuid, right, left));
      self.wuMessages   :=  std.system.workunit.WorkunitMessages(trim (L.wuid, right, left));
      self              :=  L;
  end;

  //Workunit Services can not run from a thor slave, so wrap in nothor
  return nothor (project (workunits, getWUList (left)));
  end;

/**********************************************************************************************************
Get error messages for the provided workunit list.
**********************************************************************************************************/
    export getErrMsgs (dataset (wuListLayout)  wuList)  :=  function

    errMsgLayout  normMsgs (wuListLayout  L,  WsMessage R)  :=  transform
      self.wuid     :=  if (R.severity = 2,  L.wuid,  skip) ;  //  Only keep error messages
      self.message  :=  regexreplace (‘ (\\n\\t) ‘,  R.message,  ‘  ‘ );  //  Strip newlines and tabs
      self          :=  R;
    end;

    return normalize (wuList,  left.wuMessages,  normMsgs (left, right));
 end; 

    // Output the error messages for WUIDs run on Thor 2/26 – 2/27
    wuList  := getWorkunitList (20200226, 20200227);
    output (getErrMsgs (wuList));

The output is a dataset that shows error messages for all workunits that ran between 2/26 and 2/27.

Image 18 – Example Code Output
Image 18 – Example Code Output

So far, we looked at a demo of the HPCC Systems Thor Monitor in Power BI and learned about Workunit Services standard library functions. Now, let’s look at how data flows from Thor to Power BI.

Getting Data to Power BI

There are a number of steps involved in getting data to Power BI. They are the following: 

Step 1 – Data is collected from multiple Thor environments, so ECL code using workunit services functions are run on each of the Thor environments to collect that data. This results in an output of three (3) datasets. One dataset has a list of workunit ids and the details from workunitlists and workunittimings. One dataset contains graph timings for all of the workunits, with one record for each subgraph. And another dataset shows the error messages with one record for each error message.

Step 2 – There is ECL code that is running on one of the Thor environments that collects the data from each of the other environments, combines it all together, does some minor transformations, and joins it with the historical data. 

Step 3 – ECL desprays the data to a server where a script picks it up and loads it to SQL server.

Step 4 – Once it’s in the SQL server tables, Power BI connects to SQL server and picks up the data. Power BI recognizes the relationships between the tables and allows for creation of visualizations. 

Image 19 – Getting Data to Power BI
Image 19 – Getting Data to Power BI

Summary

In this blog we looked at how the ECL Workunit Services standard library functions are used to capture details about workunits running on Thor. These details included processing time, errors, current state, and other data points. We also examined the flow of this data from Thor to Power BI, and looked at how this information can be used for monitoring, trending, error analysis, degradation, and other data points that can help improve Thor environment efficiency.

Future work includes adding additional monitoring features.

About Jessica Skaggs

Jessica is a Consulting Software Engineer at LexisNexis Risk Solutions. She has been with LexisNexis Risk Solutions since 2001. Jessica spent 16 years as a Image 20developer and technical lead with the Legal & Professional division, before moving to her current role, when she began learning about and working with HPCC Systems, ECL, and Power BI. Jessica has a Bachelor of Science degree in Systems Analysis from Miami University (OH).

Acknowledgements

A special thank you to Jessica Skaggs for her amazing presentation, “HPCC Systems Thor Monitor – Using Workunit Services and Power BI to Monitor Thor Activity” at HPCC Systems Tech Talk 32, and her contributions and guidance for this blog post.