## Implement an approximate n-tile algorithm

Hello,

I am also interested in writing proposal for implementing an approximate n-tile algorithm. Could you, please, give more explanations about such kind of algorithm? Is it NP-hard problem? I found MAX-MIN tiling algorithm for two dimensional array. Is that what I need?

Thank you in advance.

I am also interested in writing proposal for implementing an approximate n-tile algorithm. Could you, please, give more explanations about such kind of algorithm? Is it NP-hard problem? I found MAX-MIN tiling algorithm for two dimensional array. Is that what I need?

Thank you in advance.

- artkom
**Posts:**4**Joined:**Mon Mar 23, 2015 2:54 pm

Hi,

I have alerted the mentor of this project to your request for information. Until then, there are some things you can do to get started using HPCC Systems and familiarise yourself with the ECL language as follows:

• Download the software (server, client tools, graph control ECL IDE) from here: http://hpccsystems.com/download/free-community-edition

• If you want to delve into the platform code, download the sources from here: https://github.com/hpcc-systems/HPCC-Platform. The sources for the ML library can be found here: https://github.com/hpcc-systems/ecl-ml

• Once you have an HPCC Systems environment up and running, run some simple examples. Try the HPCC Systems Data Tutorial here: http://hpccsystems.com/download/docs/da ... rial-guide or Six Degrees of Kevin Bacon here: http://hpccsystems.com/download/docs/six-degrees

• Take a basic online ECL training course here: http://hpccsystems.com/community/traini ... s/training

I hope this helps for now.

Lorraine Chapman

HPCC Systems GSoC Administrator

I have alerted the mentor of this project to your request for information. Until then, there are some things you can do to get started using HPCC Systems and familiarise yourself with the ECL language as follows:

• Download the software (server, client tools, graph control ECL IDE) from here: http://hpccsystems.com/download/free-community-edition

• If you want to delve into the platform code, download the sources from here: https://github.com/hpcc-systems/HPCC-Platform. The sources for the ML library can be found here: https://github.com/hpcc-systems/ecl-ml

• Once you have an HPCC Systems environment up and running, run some simple examples. Try the HPCC Systems Data Tutorial here: http://hpccsystems.com/download/docs/da ... rial-guide or Six Degrees of Kevin Bacon here: http://hpccsystems.com/download/docs/six-degrees

• Take a basic online ECL training course here: http://hpccsystems.com/community/traini ... s/training

I hope this helps for now.

Lorraine Chapman

HPCC Systems GSoC Administrator

- lchapman
**Posts:**52**Joined:**Wed May 11, 2011 11:49 am

The algotrithm is definitely not an NP-Compete type of problem.

There is an exact n-tile implementation in the ML.FieldAggregates attribute. The problem with this implementation is that it requires a global sort of the data. This in turn involves a complete re-distribution of the data.

An approximate solution will not require that global sort. Instead, there will be some local sorting and local data reduction; and then some distribution of the reduced dataset.

There is an exact n-tile implementation in the ML.FieldAggregates attribute. The problem with this implementation is that it requires a global sort of the data. This in turn involves a complete re-distribution of the data.

An approximate solution will not require that global sort. Instead, there will be some local sorting and local data reduction; and then some distribution of the reduced dataset.

- john holt
- Community Advisory Board Member
**Posts:**22**Joined:**Mon Jun 25, 2012 12:43 pm

What does "fixed accuracy (m = n*#nodes will produce counts that are accurate within the number of nodes)" mean?

n - Ntiles,

#nodes = number of nodes? If it is, it will be always a constant!

Why we have to do a "local data reduction"? Let's imagine, we sprayed our dataset on N nodes. We still have the same number of records, right?

n - Ntiles,

#nodes = number of nodes? If it is, it will be always a constant!

Why we have to do a "local data reduction"? Let's imagine, we sprayed our dataset on N nodes. We still have the same number of records, right?

- artkom
**Posts:**4**Joined:**Mon Mar 23, 2015 2:54 pm

Let me start with the last question first. I have sprayed 10 million record dataset to 100 nodes in a 100 node cluster. Assume that the objective is to find the 2-tile (which is the median).

Each node sees only the data on that node. Each node can determine the median independently, but the medians may all be different. If I want to know the median for the total dataset, there needs to be some information sharing.

Obviously, the records could be shared. This is effectively the current implementation approach. If I perform a local sort and some data reduction, like reducing the original set to 100 records that each describe 1% of the local data with high and low values; I can share that data to all of the nodes so every node sees the 100*100 (10 thousand) records.

Each node independently uses these 10 thousand records to approximate the global median. THe degree of accuracy will be a function of the number of nodes and number of summary records.

Does this make sense?

Each node sees only the data on that node. Each node can determine the median independently, but the medians may all be different. If I want to know the median for the total dataset, there needs to be some information sharing.

Obviously, the records could be shared. This is effectively the current implementation approach. If I perform a local sort and some data reduction, like reducing the original set to 100 records that each describe 1% of the local data with high and low values; I can share that data to all of the nodes so every node sees the 100*100 (10 thousand) records.

Each node independently uses these 10 thousand records to approximate the global median. THe degree of accuracy will be a function of the number of nodes and number of summary records.

Does this make sense?

- john holt
- Community Advisory Board Member
**Posts:**22**Joined:**Mon Jun 25, 2012 12:43 pm

Yes, it does. Thank you.

One more question about a tolerance in NTiles_approx() function. We have to compare our approximate solution with exact one (my assumption that it should be ML.FieldAggregates(..)), right?

If it's not, what we have to choose in order to compare our approximate solution with?

One more question about a tolerance in NTiles_approx() function. We have to compare our approximate solution with exact one (my assumption that it should be ML.FieldAggregates(..)), right?

If it's not, what we have to choose in order to compare our approximate solution with?

- artkom
**Posts:**4**Joined:**Mon Mar 23, 2015 2:54 pm

Artem,

Yes, a compare against the current exact implementation is a good idea.

Yes, a compare against the current exact implementation is a good idea.

- john holt
- Community Advisory Board Member
**Posts:**22**Joined:**Mon Jun 25, 2012 12:43 pm

One thing left unclear to me: in order to find approximate solution with a given tolerance we have to compare it with exact one, which involves global redistribution anyhow. Therefore, in order to find close solution we have to run exact one for every new dataset. That means that performance of our approximate solution will be less than exact one, right? If so, where is a point to do this?

- artkom
**Posts:**4**Joined:**Mon Mar 23, 2015 2:54 pm

Artem,

You can determine the approximate n-tile to a specified maximum error without knowing the exact error. The claim is merely that the error will be no more than X. It could be a lot less than X or could even be zero.

You can determine the approximate n-tile to a specified maximum error without knowing the exact error. The claim is merely that the error will be no more than X. It could be a lot less than X or could even be zero.

- john holt
- Community Advisory Board Member
**Posts:**22**Joined:**Mon Jun 25, 2012 12:43 pm

John,

Just an FYI - the exact algorithm can be sped up significantly (at least in the discretish case) - using TABLE,MERGE as a precursor

David

Just an FYI - the exact algorithm can be sped up significantly (at least in the discretish case) - using TABLE,MERGE as a precursor

David

- dabayliss
- Community Advisory Board Member
**Posts:**109**Joined:**Fri Apr 29, 2011 1:35 pm

10 posts
• Page

**1**of**1**### Who is online

Users browsing this forum: No registered users and 1 guest