Wed Aug 15, 2018 1:58 am
Login Register Lost Password? Contact Us


Data synchronization and querying

Post questions or comments on how best to manage your big data problem

Thu Aug 02, 2012 9:07 am Change Time Zone

Hello,

As per my understanding, the ECL queries submitted to a Roxie cluster can be :
1. Executed on a remote Thor cluster which has all the BigData(in TB/PB/ZP)
2. Executed on the same Roxie cluster itself, first by referring the remote data till it is getting copied on to Roxie and then, locally

There are a few queries I have here :

1. Assuming 1. is happening, the query processing is taking several seconds,probably minutes, given the large data. Now, while these queries are in progress, some new data is sprayed onto this Thor cluster. Now, will the running query consider this new data set or will it continue on the 'older' data set and give results accordingly :? ?
2. Assuming 2. is happening,
i. Again, the query is identical to 1.
ii. Suppose the query processing is complete at t1 and there is already some new data added to the Thor cluster before t1. Now, how and when does this new data come to Roxie(synchronization)? Again, at t2, if the same/similar query comes in, will it be run on the 'latest' data set? In simple words, how is the data between Thor and Roxie 'synchronized' ?

Thanks and regards !
kaliyugantagonist
 
Posts: 43
Joined: Mon Jul 23, 2012 11:23 am

Thu Aug 02, 2012 1:54 pm Change Time Zone

As per my understanding, the ECL queries submitted to a Roxie cluster can be :
1. Executed on a remote Thor cluster which has all the BigData(in TB/PB/ZP)
Sorry, but that is not correct.

Queries sent to a Roxie are executed on that Roxie -- they may either:
  • access data locally on the Roxie (the "normal" way things are done in a production environment)
  • or remotely access the data on a Thor cluster (usually done from a 1-node Roxie used just for query development/testing)
You could, of course, use SOAPCALL to have your Roxie query launch a Thor job, but that would be working against the system design and not with it.

2. Executed on the same Roxie cluster itself, first by referring the remote data till it is getting copied on to Roxie and then, locally
Yes. That scenario is possible. You can have Roxie configured to access data remotely while the data is in the process of being copied from Thor to Roxie.
2. Assuming 2. is happening,
i. Again, the query is identical to 1.
ii. Suppose the query processing is complete at t1 and there is already some new data added to the Thor cluster before t1. Now, how and when does this new data come to Roxie(synchronization)? Again, at t2, if the same/similar query comes in, will it be run on the 'latest' data set? In simple words, how is the data between Thor and Roxie 'synchronized' ?
This question presumes that HPCC operates like an RDBMS and can do OLTP -- this is not the case. HPCC is a batch-processing type of environment. Data files read in a job are never written to, therefore there is no "update" functionality. There are techniques that can be used to make an HPCC environment closely emulate an OLTP system, but accomplishing that requires a fairly complex design and implementation.

Thor and Roxie serve very different purposes:
  • Thor does one job at a time and is used to prepare massive amounts of data for delivery to customers.
  • Roxie delivers final result data to each query as it comes in, using the data that has been pre-built, pre-linked, pre-whatevered by Thor so that Roxie can deliver the individual goods as quickly as possible, handling literally thousands of separate query results per second.
  • The only "normal" direct interaction between Thor and Roxie comes when a query is published to Roxie and Roxie copies the necessary data over from Thor.
HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1368
Joined: Wed Oct 26, 2011 7:40 pm


Return to Managing Big Data

Who is online

Users browsing this forum: No registered users and 1 guest