Mon Aug 20, 2018 8:06 pm
Login Register Lost Password? Contact Us


handling updated records

Post questions or comments on how best to manage your big data problem

Fri Sep 02, 2011 5:24 pm Change Time Zone

I'm working on a data ingestion process and I keep getting tangled up on record "updates". As far as I know there is no mechanism for updating a logical file (in a SQL sense), correct?

The basic problem is that incoming files can have a mixture of never-before-seen new records, updates to existing records, duplicates of existing records (aka junk), reversal/delete identifiers, and garbage. We need to find the data changes and apply them to the production data repository that is constantly accumulating.

The plan is to structure the program something like this:
1. Spray new files as a working logical file
2. Compare the working file to the production file, identify New/Updated/Junk data in the working set
3. Apply the changed data to production

For step 3, it seems like we have basically two strategies but I keep bouncing between, "gee I don't want to overwork THOR so I better find an elegant way to handle updates", and, "it's THOR, tell it what you want and get out of the way, don't over think it". The strategies are:

Brute Force
1. Create a filtered production recordset, anti-Join Production with the "working updates" recordset
2. Merge the updates and new records with the filtered production data
3. Output a new logical file, perform superfile maintenance as needed

Elegant
1. Use a record versioning field
2. When updates arrive we find the current record version and increment it
3. Hide prior versioned records from downstream processes using a group/filter attribute
4. Periodically adjust the dataset to remove hidden records, if desired

"Brute force" feels wrong because a relatively small amount of incoming data is prompting the rebuild of a relatively large production file.

"Elegant" feels wrong because it's practically a transaction log, and HPCC isn't exactly built for transactions.

I keep thinking that I'm forgetting a big picture concept somewhere...it has been a couple years since my ECL classes. Any advice will be much appreciated!

Thanks,
Jason
aintnomyth
 
Posts: 86
Joined: Wed Jul 13, 2011 7:40 pm

Fri Sep 02, 2011 6:45 pm Change Time Zone

Interestingly I was having a chat this morning with one of our senior guys about the correct approach to developers producing 'elegant' solutions - unfortunately it is forbidden under the constitution :twisted:

You ALWAYS want to go with brutally simple - until you can't.

In the case of Ingest there is a very simple fall-back from 'too brutally simple' - and that is the delta - or even cascading delta.

The concept is this: if the file being ingested is tiny then you can 'join' (anti-join etc etc) it against the bigger file very easily (eg lookup join, or PARTITION on a join etc). Thus you can very quickly annotate your 'little' file with all of the flags / notes / sequences you need. You then simply dump it (the little file) down on the disk and 'append' it to the larger file using a superfile (or superkey in the roxie delivery case). You can handle a delete by having a 'delete' record that kills its own instance during the read phase.

Then every 'now and again' (obviously dependant upon speeds, feeds and thor cycles) you gather up the original big file, all the data in the little bits and you do the sort/dedup/re-write - and you are back down to one big 'perfect' file and you can start collecting your fragments again.

In the extreme case of a HUGE base file (Petabytes, or perhaps very high TB) with low latency update requirements (say - 5 minutes) then you can go to the 'cascading' version. Same idea - except you might only want to touch the huge file once a month but you don't want thousand of scratty little files laying around. So you (say) roll your 5 minute files into a 1 hour file, then roll your 1 hours files into a daily file, dailies into a weekly and then finally weeklies into the main file once a month. (Obviously I am picking arbitrary numbers - you can have a 27.32 hourly file if you really want)
dabayliss
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 109
Joined: Fri Apr 29, 2011 1:35 pm

Fri Sep 02, 2011 6:47 pm Change Time Zone

Incidentally - 'Ingest' is one of the capabilities handled automatically by our SALT code generator ....

David
dabayliss
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 109
Joined: Fri Apr 29, 2011 1:35 pm

Tue Sep 06, 2011 12:49 pm Change Time Zone

Thanks, David, that was very helpful.
aintnomyth
 
Posts: 86
Joined: Wed Jul 13, 2011 7:40 pm


Return to Managing Big Data

Who is online

Users browsing this forum: No registered users and 1 guest