Redundant data in raw files
Hello,
I have about 10 files that are related by a 3 field composite key (roughly 45 bytes total). In the RDBMS world I would be inclined to convert the natural key into a numeric surrogate key to reduce the footprint and hopefully improve sorts and joins. The 3 key fields constitute 20-30% of the total data size.
From a performance perspective, does it make sense to do any sort of speculative pre-processing in HPCC?
Thanks!
I have about 10 files that are related by a 3 field composite key (roughly 45 bytes total). In the RDBMS world I would be inclined to convert the natural key into a numeric surrogate key to reduce the footprint and hopefully improve sorts and joins. The 3 key fields constitute 20-30% of the total data size.
From a performance perspective, does it make sense to do any sort of speculative pre-processing in HPCC?
Thanks!
- aintnomyth
- Posts: 86
- Joined: Wed Jul 13, 2011 7:40 pm
Well - I don't know if I would describe it as 'speculative pre-processing' but essentially yes. Whilst HPCC is probably the fastest thing out there - we are still bound by the laws of physics. In general you should get your data model correct and TIGHT as early in your processing as possible.
By TIGHT I mean:
a) Fixed fields if possible (and as small as possible)
b) Into 'correct' types if possible (numbers as UNSIGNED/INTEGER etc)
c) Linking fields as UNSIGNED
Now - there is a slightly 'greyer' trade-off with regard to some of the more exotic but compressed types such as QSTRING and UNSIGNED3 etc. It costs more cycles to get data in and out of those types but they are smaller (which means they come off disk faster, go across network links faster and consume less memory). My general rule of thumb is that fields I use 'all the time' I will allow a fatter type that is natural to the system (UNSIGNED4/UNSIGNED8 etc) - fields that are just carried around for occasional use I will squeeze down.
HTH
David
By TIGHT I mean:
a) Fixed fields if possible (and as small as possible)
b) Into 'correct' types if possible (numbers as UNSIGNED/INTEGER etc)
c) Linking fields as UNSIGNED
Now - there is a slightly 'greyer' trade-off with regard to some of the more exotic but compressed types such as QSTRING and UNSIGNED3 etc. It costs more cycles to get data in and out of those types but they are smaller (which means they come off disk faster, go across network links faster and consume less memory). My general rule of thumb is that fields I use 'all the time' I will allow a fatter type that is natural to the system (UNSIGNED4/UNSIGNED8 etc) - fields that are just carried around for occasional use I will squeeze down.
HTH
David
- dabayliss
- Community Advisory Board Member
- Posts: 109
- Joined: Fri Apr 29, 2011 1:35 pm
That definitely helps, thanks for the quick reply.
- aintnomyth
- Posts: 86
- Joined: Wed Jul 13, 2011 7:40 pm
3 posts
• Page 1 of 1
Who is online
Users browsing this forum: No registered users and 1 guest