Wed Aug 15, 2018 2:53 pm
Login Register Lost Password? Contact Us


Implement an approximate n-tile algorithm

This forum is for topics related to the Google Summer of Code (GSoC) projects and the HPCC Systems Intern program.

Mon Mar 23, 2015 2:58 pm Change Time Zone

Hello,

I am also interested in writing proposal for implementing an approximate n-tile algorithm. Could you, please, give more explanations about such kind of algorithm? Is it NP-hard problem? I found MAX-MIN tiling algorithm for two dimensional array. Is that what I need?

Thank you in advance.
artkom
 
Posts: 4
Joined: Mon Mar 23, 2015 2:54 pm

Mon Mar 23, 2015 3:37 pm Change Time Zone

Hi,

I have alerted the mentor of this project to your request for information. Until then, there are some things you can do to get started using HPCC Systems and familiarise yourself with the ECL language as follows:

• Download the software (server, client tools, graph control ECL IDE) from here: http://hpccsystems.com/download/free-community-edition

• If you want to delve into the platform code, download the sources from here: https://github.com/hpcc-systems/HPCC-Platform. The sources for the ML library can be found here: https://github.com/hpcc-systems/ecl-ml

• Once you have an HPCC Systems environment up and running, run some simple examples. Try the HPCC Systems Data Tutorial here: http://hpccsystems.com/download/docs/da ... rial-guide or Six Degrees of Kevin Bacon here: http://hpccsystems.com/download/docs/six-degrees

• Take a basic online ECL training course here: http://hpccsystems.com/community/traini ... s/training

I hope this helps for now.

Lorraine Chapman
HPCC Systems GSoC Administrator
lchapman
 
Posts: 56
Joined: Wed May 11, 2011 11:49 am

Mon Mar 23, 2015 4:32 pm Change Time Zone

The algotrithm is definitely not an NP-Compete type of problem.

There is an exact n-tile implementation in the ML.FieldAggregates attribute. The problem with this implementation is that it requires a global sort of the data. This in turn involves a complete re-distribution of the data.

An approximate solution will not require that global sort. Instead, there will be some local sorting and local data reduction; and then some distribution of the reduced dataset.
john holt
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 22
Joined: Mon Jun 25, 2012 12:43 pm

Tue Mar 24, 2015 6:08 pm Change Time Zone

What does "fixed accuracy (m = n*#nodes will produce counts that are accurate within the number of nodes)" mean?
n - Ntiles,
#nodes = number of nodes? If it is, it will be always a constant!

Why we have to do a "local data reduction"? Let's imagine, we sprayed our dataset on N nodes. We still have the same number of records, right?
artkom
 
Posts: 4
Joined: Mon Mar 23, 2015 2:54 pm

Wed Mar 25, 2015 5:41 am Change Time Zone

Let me start with the last question first. I have sprayed 10 million record dataset to 100 nodes in a 100 node cluster. Assume that the objective is to find the 2-tile (which is the median).

Each node sees only the data on that node. Each node can determine the median independently, but the medians may all be different. If I want to know the median for the total dataset, there needs to be some information sharing.

Obviously, the records could be shared. This is effectively the current implementation approach. If I perform a local sort and some data reduction, like reducing the original set to 100 records that each describe 1% of the local data with high and low values; I can share that data to all of the nodes so every node sees the 100*100 (10 thousand) records.

Each node independently uses these 10 thousand records to approximate the global median. THe degree of accuracy will be a function of the number of nodes and number of summary records.

Does this make sense?
john holt
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 22
Joined: Mon Jun 25, 2012 12:43 pm

Wed Mar 25, 2015 7:30 pm Change Time Zone

Yes, it does. Thank you.

One more question about a tolerance in NTiles_approx() function. We have to compare our approximate solution with exact one (my assumption that it should be ML.FieldAggregates(..)), right?

If it's not, what we have to choose in order to compare our approximate solution with?
artkom
 
Posts: 4
Joined: Mon Mar 23, 2015 2:54 pm

Thu Mar 26, 2015 2:03 pm Change Time Zone

Artem,
Yes, a compare against the current exact implementation is a good idea.
john holt
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 22
Joined: Mon Jun 25, 2012 12:43 pm

Thu Mar 26, 2015 3:41 pm Change Time Zone

One thing left unclear to me: in order to find approximate solution with a given tolerance we have to compare it with exact one, which involves global redistribution anyhow. Therefore, in order to find close solution we have to run exact one for every new dataset. That means that performance of our approximate solution will be less than exact one, right? If so, where is a point to do this?
artkom
 
Posts: 4
Joined: Mon Mar 23, 2015 2:54 pm

Thu Mar 26, 2015 5:54 pm Change Time Zone

Artem,
You can determine the approximate n-tile to a specified maximum error without knowing the exact error. The claim is merely that the error will be no more than X. It could be a lot less than X or could even be zero.
john holt
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 22
Joined: Mon Jun 25, 2012 12:43 pm

Mon Apr 13, 2015 11:48 pm Change Time Zone

John,

Just an FYI - the exact algorithm can be sped up significantly (at least in the discretish case) - using TABLE,MERGE as a precursor

David
dabayliss
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 109
Joined: Fri Apr 29, 2011 1:35 pm


Return to Student Programs

Who is online

Users browsing this forum: No registered users and 1 guest