Fri Aug 19, 2022 4:26 pm
Login Register Lost Password? Contact Us

Please Note: The HPCC Systems forums are moving to Stack Overflow. We invite you to post your questions on Stack Overflow utilizing the tag hpcc-ecl ( This legacy forum will be active and monitored during our transition to Stack Overflow but will become read only beginning September 1, 2022.

Interaction between big and small data

Post questions or comments on how best to manage your big data problem

Fri Jul 15, 2011 3:40 pm Change Time Zone

Sorry for the long post but I want to give an accurate description of the problems I'm facing. Any help is appreciated!

I work for a healthcare decision support business. We're looking for a better solution than the current Oracle/.Net approach. We service dozens of clients, each using one or more: large batch processes (sets of 10-15 text files totaling 20-30GB); small batch processes (files totally 100MB-5GB); or near real-time TCP/IP streams.

The batch data is cleaned, aggregated, and delivered to reporting servers in monthly cycles. Our current technology drives the batch intervals; clients actually want to break the batches into smaller chunks (like daily instead of monthly) as their data tends to be stale by the end of the reporting month. The real-time data is cleaned, analyzed, and compared to previously received batches using processes that run in 15 minute, 1 hour, or 1 day intervals. The streams are tiny, less than 50MB per day of raw data.

Note: the streams represents clinical data for active inpatient admissions which are only valuable for a few days. Inpatient data is not appended to any of the batched data. Hospitals have an internal process to alter/enhance the data after patients are discharged, this enhanced discharge data is sent to us in the large batches.

I can definitely see the HPCC system solving most of the batch problems but I'm a little fuzzy on two things:
1) Handling the streams - My approach would be to separate the batch data from the streams data using 2 THORs. THOR1 would handle large batches and publish to Roxie. THOR2 would handle the micro batches but here’s where my plan falls apart. The micro batches need read-access to data on Roxie, would this happen in THOR2 or would I need to publish to Roxie first? I’m assuming it’s possible for two THORs to publish to the same Roxie. Would I be better off managing that data in an OLTP database since it's not strictly "big data"? The problem with that approach is I would have to duplicate all of the parsing/cleanup/analysis rules in ECL and SQL/.Net.

2) Reducing batch latency with more frequent smaller batches – batch files could contain new records, updates for existing records, “reversals” indicating deletes of existing data, or duplicated junk data that has already been processed. Since the final output is a blend of detailed and aggregated information we have to re-aggregate practically everything when any data changes. This is expensive in our current environment; the total “ingest, recompute, and publish time” represents our theoretical minimum process latency. HPCC seems to handle this better but I’m not seeing any sort of append/delete mechanism, does that suggest that each data deliverable (on Roxie) inherently recomputes everything, and if it does, do I even care?

Thanks for the help!
Posts: 86
Joined: Wed Jul 13, 2011 7:40 pm

Fri Jul 15, 2011 4:30 pm Change Time Zone

Hi Jason,

Welcome to my life :) This is very much the sort of system that we have (actually we have a number of them for different parts of the business). As I'm sure you appreciate it is difficult to design an 'optimal' system without more knowledge of the speeds and feeds - but let me outline a few facts for you that I think might help you get there.

1) ALL clusters in a single environment (Dali instance) can read from and write to each others disks (in fact you can read between environments too). Where you choose to put your 'master' copy of the data is a matter of system design- but whereever it is everyone will be able to get to it.

2) ECL has a notion of superfiles (and superkeys). These are logical files names that can be used by your ECL code - however they can refer to multiple actual files. Therefore it is possible to 'append' data to a file, or even to 'shift' a chunk of data between files without any actual data moving (only meta data)

3) In terms of the 'little batches' - as long as you don't need genuine transactions in the computational sense (ie a record lock across multiple files) - there is no real downside to using the HPCC. We have many, many processes working on mini-batches - 10-100 records at a shot. If you do need transactions we use what we call a delta-base - this is an ultra-thin SQL front end which handles transactions in flight and then we rip the data out of it every (say) 15 minutes.

Addressing your two principle questions - I believe FACT 1 essentially puts your plan back in play.

In terms of your second point - we have a couple of major (in terms of size and criticality) processes that do this - we use a cascading rollup trick. Obviously you can tweak the numbers to suit your circumstance - but the idea is this -
Our 'model' is we have monthly, weekly, daily, hourly and 10 minute files. Our master file is really some collection of these wired together with a superfile. Our 'running' process has the job of spitting out 10 minutes files (every 10 minutes). We then have an hourly job which rolls the last 6 up into an hourly file. Then daily we roll 24 of those up into a daily file. We roll seven of those up into a weekly files etc.

In the 10 minute file a delete is just a record that says: delete this - our query processes WILL apply those deletes on the fly if they need to. Then at each rollup any deletes than can be applied are - others that are not 'paired' yet remain as deletes.

Naturally we organize our workflow so that our daily, weekly, monthly rollups occur during those periods when our machinery would otherwise be less busy. (Note - we also have technology such as MERGE which can make the rollups MUCH faster than one might expect)

I hope the above makes some sense; if not - or if you have some further questions - please feel free to ask.

Community Advisory Board Member
Community Advisory Board Member
Posts: 109
Joined: Fri Apr 29, 2011 1:35 pm

Return to Managing Big Data

Who is online

Users browsing this forum: No registered users and 1 guest