Sun Mar 18, 2018 11:56 am
Login Register Lost Password? Contact Us

Data Management (ingestion and more)

Post questions or comments on how best to manage your big data problem

Fri Jan 01, 2016 8:41 pm Change Time Zone


I was wondering if there was any solution/product/plan regarding easier data ingestion, in a "pull fashion" (static data or even web service)?
Like downloading a file, maybe unzipping it, loading it and probably running some transformations on it to get something we can work from (so like a DownloadOnSteroid+ELT). Or querying MARTA Bus Real-time RESTful Web Service. And of course all within the coziness of my ECL IDE.

I'm really not talking about grabbing data from an RDBMS, and I don't see why I'd need to use Flume for this kind of ingestion (not a big fan of squashing flies with a hammer).
But I end up doing most (if not almost everything) through ECL IDE and I find it frustrating to have to switch to anything else to do data ingestion.

Loading third party data is something I have to deal with a lot, but maybe I'm just the exception?

I want to run ingestion against a single node and if possible always the same.I can provision extra space for it for data ingestion. Running ECL code against hThor would do

For example:
Code: Select all
Curl := MODULE
EXPORT info_layout := RECORD
  STRING content_type;
  STRING http_code;

EXPORT download( STRING url, STRING localUri, ..... ) := PIPE('curl -w \'%{content_type}\t....\' -s -o ' + localUri + ' "' + url + '"', info_layout, CSV(SEPARATOR('\t')) );


and wrap other linux programs as well and end up with some library to use to help ingest data.

Code: Select all
ingest_download() := FUNCTION
oLocalPath := '/ingestion/NOAA/GHCN';
oLocalFile := oLocalPath + '';
  OUTPUT( BinUtils.mkdir( oLocalPath, true ), NAMED('CreatePath')),
  OUTPUT('http://...../', oLocalFile, false), NAMED('Download')),
  OUTPUT( BinUtils.checksum( oLocalFile ), NAMED('Checksum')),
  OUTPUT( Zip.unzip(oLocalFile, oLocalPath, true), NAMED('Unzipping')),

ingest_elt() := FUNCTION
oDS := DATASET(std.File.ExternalLogicalFilename(LandingZone_IP, File_In), raw_nppes_layout, CSV(HEADING(1), SEPARATOR([',']), QUOTE(['"']), TERMINATOR(['\n','\r\n','\n\r'])));
oDist := DISTRIBUTE(oDS, HASH(npi));

Is that just plain stupid?

I've read a bit about Orbit (although can't find a lot and no release as far as I could see) and some initiative to make HPCC Modules easier to user/integrate or something of the sort.
The vision here is a bit similar. I'd love to create some sort of (ECL) Module/Package that would manage access, ingestion, cataloging, updates, etc. to certain (public or internal) data (e.g. Weather Historical Data or Weather Forecasts Data), and someone would just need to install that module and run the necessary functions/queries from it to ingest data. "The rpm for data ingestion".

One way would be through apt maybe, creating a Debian Package with ECL code in it, and calling it ( in a PIPE-against-hThor" kinda way) to install the code and run whatever I need against it to ingest whatever data it manages.
In an apt way, it'd be like
Code: Select all
apt install noaa-weather-forecast
apt install noaa-weather-historical
noaa-weather-forecast ingest 3days
noaa-weather-historical ingest temp 2014

and I could wrap each in some PIPE code to run any of those commands through hThor (if viable).

Thoughts? Comments? Criticism?

Posts: 46
Joined: Wed Sep 10, 2014 3:14 am

Tue Jan 05, 2016 7:26 pm Change Time Zone

Hi Luke, thanks for your post! It appears you are looking for a way to download, decompress and parse individual files contained in a remote site accessible to HTTP, but with a well-known path. Try the use of wildcards in the path if that would satisfy this requirement?
Site Admin
Site Admin
Posts: 202
Joined: Thu Jan 27, 2011 10:58 am

Tue Jan 05, 2016 8:46 pm Change Time Zone

Thanks for the reply!

Hmmm...I'm not sure I understand the wildcard idea. Maybe you're one step ahead of me. Care to elaborate?

Otherwise that's correct: I'm looking for some guidelines/feedback/thoughts on how to go about downloading and pre-processing some files before loading them into my cluster (with or without further processing), through ECL actions (i.e. trying to avoid either ECL Watch/upload/spray files or logging into the server and running linux commands manually).

I guess I'm having 2 problems/uncertainties.

Data processing is (can be) massively parallel and HPCC Systems cluster(s) handle that just great. But I don't see (and not experiencing) massive parallelism in data ingestion in all cases, especially those where I'm pulling data from a third party.
I can load a file from a path at a given IP (Std.File.ExternalLogicalFilename) but I can't (true?) specify the IP address where to run "curl" (PIPE) to download that file from a url and store it at that same path. Am I missing something? Going about it wrong?
If it's right, one thought I had was to make sure hThor is only setup on a single node. hThor being a single-node process, I'd still have to make sure it runs off the same node every time so I can load (Std.File.ExternalLogicalFilename) the data from there. Thoughts?

Now let's say the issue mentioned above is not an issue (if it ever was) and I can provide someone with some ECL code and modules to load public school directory data as well as financial data, and it works on their cluster (either it works on any cluster setup, or that someone configured the cluster as stated above).
Code: Select all
IMPORT Education.NCES;
NCES.download_and_load_financial(2013, ...);
// Then, in a different execution:
// dir := NCES.dsDirectory(2013);
// fin := NCES.dsFinancial(...);
// A := JOIN(dir, fin, LEFT.unitid = RIGHT.unitid, .....);
// TABLE(A, { ..... }, ..... );

Is there any other way (now or in the works?) to share that whole Education package besides versioning the ECL code and letting people clone that git repo (or subversion and such)?

Thanks a lot!
Posts: 46
Joined: Wed Sep 10, 2014 3:14 am

Return to Managing Big Data

Who is online

Users browsing this forum: No registered users and 1 guest