Sat Aug 18, 2018 9:42 pm
Login Register Lost Password? Contact Us


Handling Node Failure

Topics related to recommendations or questions on the design for HPCC Systems clusters

Wed Mar 30, 2016 5:38 am Change Time Zone

Hi,

It is said that the node failures in HPCC cluster are handled by replicating data in other nodes. I assumed that even if a node goes down, the cluster will still be up.

But what I understood recently is that, if a node goes down then the Thor cluster as whole also goes down.

So my questions are,
1. Why can not the Thor cluster be up even if a node fails?
2. What should we do, if the failed node goes to an unrecoverable state?

Thanks,
Ramesh
rameshpachamuthu
 
Posts: 9
Joined: Tue Dec 29, 2015 1:02 pm

Thu Mar 31, 2016 4:11 pm Change Time Zone

Hi Ramesh,

1. Why can not the Thor cluster be up even if a node fails?

The way I understand it, if a node drops out in the middle of a job, the job will try to complete using the replicated node. After that, you would then need to replace the node.

What should we do, if the failed node goes to an unrecoverable state?

Because THOR is the development cluster, the best practice is to take the cluster down, replace or repair the node, and then restart the cluster. I have been told that a "hot swap" (replacing a node while the cluster is still running) can be done but it is just safer to stop the cluster and replace after that.

Regards,

Bob
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 975
Joined: Wed Jun 29, 2011 7:13 pm

Tue Apr 05, 2016 9:39 am Change Time Zone

Hi Bob,

Thanks for your response.

Actually we had a 8 node cluster setup for learning purpose.

Recently one of nodes went down and was not in a state to get repaired. Since it is a cluster for learning, we did not want to replace it with a new node. Finally We were left with the option to bring the size of the cluster to 7 nodes. so We had modified environment.xml for 7 node cluster setup and which led to the loss of data present in the cluster. We were fine with the data loss because it is the cluster for learning.

We want to know the right approach that could have followed to avoid the data loss when repairing or replacing the failed node option is ruled out.

Kindly share your thoughts.

Regards,
Ramesh
rameshpachamuthu
 
Posts: 9
Joined: Tue Dec 29, 2015 1:02 pm

Tue Apr 05, 2016 3:45 pm Change Time Zone

Hi Ramesh,

If you wanted to salvage the data, it's important that you have replication on for the cluster configuration.

In that case, you would have the complete data set, because the missing files would be in the hpcc-mirror directory on the n+1 node.

Outside of HPCC, you could certainly “stitch” the data set together. (getting the missing file parts from the mirror)

And before redefining the environment together ( in this example for 8 to 7), I would try a despray back to the landing zone to salvage the data.

But once you redefine the cluster, it ( the system )thinks the datasets are composed of 7 file parts instead of 8 and "breaks".

So the important part is to salvage your data prior to changing the configuration.

Hope this helps!

Bob
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 975
Joined: Wed Jun 29, 2011 7:13 pm

Thu Dec 01, 2016 10:08 am Change Time Zone

Hi bob,

Is it mandatory that we need to salvage the sprayed data manually by de spray?
Just wanted to understand how the data will be lost, because just we are doing configuration alone and there is nothing change with folders? how the sprayed files will be deleted.

Hpcc_Mirror:
what is the frequency of this mirror happing or only during the write process alone? also why it is been storing as part1 and part2 as two copies?

Regards Nawaz
nawazkhan
 
Posts: 9
Joined: Fri Nov 25, 2016 11:20 am

Thu Dec 01, 2016 3:49 pm Change Time Zone

1. Why can not the Thor cluster be up when a node fails?

All the nodes need to be up, if a node fails, the job will fail. Once the master or one of the slaves loses connection to the failed node, an **MP link closed** error will trigger a job abort.

The intent of the hpcc-mirror directory is to prevent data loss due to catastrophic RAID / disk failure.

Once the node is replaced and back online, the system will look for the file in the primary location ( hpcc-data ), then look for it in the replicate location (hpcc-mirror).

note: It is recommended after such an event to run the backupnode utility to restore the data. Additionally, best practices it to have run nightly via cron.

2. Is it mandatory that we need to salvage the sprayed data manually by de spray?
Just wanted to understand how the data will be lost, because just we are doing configuration alone and there is nothing change with folders? how the sprayed files will be deleted.

Resizing or redefining the **width** of the thor will effectively break your dataset, as it was originally defined to have 8 parts. The metadata will not know where to find the missing data.

To clean the **bad data** you should be able to delete via the eclwatch interface.

Or alternatively you can bring up a clean dali( basically lose all the metadata regarding files, workunits run etc), by renaming or deleting the "hpcc-data/dali" directory. The data on disk can be deleted using the "XREF" utility in the ECLWatch interface. They will show as files on disk that which are not part of the metadata in the dali system store. Alternatively, you can delete the hpcc-data directory on all the thorslaves. The directories will get recreated once the thor restarts.

If you choose to pick any of the **delete** options the system must be down.

Hpcc_Mirror:

By default and recommended setting, the write to the replicate location (hpcc-mirror) happens asynchronously, so after the write has happened to the primary location (hpcc-data) directory.
fernando
 
Posts: 5
Joined: Thu Jun 19, 2014 1:29 pm

Wed Dec 07, 2016 2:14 pm Change Time Zone

Thanks for the detailed explanation.
I have one more question.

Is there any detailed document to understand the below step in better way. I have referred reference document it is not explained more. Is there any criteria for defining the nodes and slave nodes for roxie and thor?

Enter number of support nodes - What is it referring as support components?
Number nodes for roxie cluster -
Number of salve nodes for thor cluster -
Number of thor slaves per node -

Regards Nawaz
nawazkhan
 
Posts: 9
Joined: Fri Nov 25, 2016 11:20 am

Thu Dec 22, 2016 7:03 am Change Time Zone

Hi

Can someone help me to find the existing environment details, like how many support nodes, slave nodes of thor, thor slaves in each node are configured?

Thanks.

Regards Nawaz
nawazkhan
 
Posts: 9
Joined: Fri Nov 25, 2016 11:20 am


Return to Clustering

Who is online

Users browsing this forum: No registered users and 1 guest

cron