Testing Cisco Hyperflex Auto Healing

Let’s break the Hyperflex system !

In the lab there is a 5 node Hyperflex cluster up and running. Everything is healthy and the system is almost for 60% filled, so that is closer to a real test then a lab test. There are several ways to break the cluster and I choose the easiest one. Just power down one random server and see what happens.

If you want to know more about the Auto Healing feature of Hyperflex :

Summary :

Auto healing kicks in after 2 hours when a node is missing of the Cluster. The cluster is Online, but got a status “Unhealty”. The time of Auto Healing depends of the size and amount of data on your cluster. In my case with a cluster filled of about 50% the repair time is about 3 hours. During these hours the system can still be used without any problems !

Proof that the system is Healthy :

Before I am going to break it, let me show you that the system is a healthy cluster. There are about 200 VM’s running on it with HDParm. See other post.

Although the https://<HX Cluster IP Address>/ui is still under Tech Preview, I am using it to show some things in an easy way.

 

Hyperflex is Online AND Healthy !

Hyperflex is Online AND Healthy !

I can see all 5 servers are not having any problems.

So that you see that the system is working like a normal cluster the next step will be :

And now we break the Hyperflex Cluster :

This can be done is various ways. Nice on, or just power it off. Not gracefully of course otherwise it isn’t a test.

 

Unhealthy isn’t it ?

Shutting the server down without any grace will make the HX Cluster unhealthy right away.

In vCenter you can also see that the cluster is unhealthy.

Also on the Dashboard of the UI you see a big red cross which indicates that something isn’t right… right ?

The unhealthy system is also noticed in the Alarms.

And we can’t get any details of the HX5 server, because it’s powered off.

In the events we see the time when the cluster became Unhealthy.

Waiting for 2 hours :

The system is still online, but unhealthy. Autohealing is turned on, but it will kick in after 2 hours of node missing. It could be possible that the node is having maintenance or something. So we will patiently.

Autohealing is in progress :

Yeah. You will see it’s in progress. The user won’t notice a thing. The system is up and running all the time with all the data on it.

If you click the blue “I” icon you will see : 

You can see how far the process is. “Time remaining before current healing operations finished” is a wrong sentence ! It’s the time that the system is busy with the Auto Healing Process.

 

With UCS Performance Manager I can see all the bandwidth that’s being used at the links from and to the HX Systems. When Auto Healing is busy, yeah, there is some more traffic, because we copy all the blocks instead of regenerate it.

 

Cluster is healthy again :

Just sit back and be patient. The Hyperflex system can take care of itself and after some hours, the system is Healthy again. You will notice in my example that we still have 1 node failure tolerance left !

In the Events you will see the time when the system is healthy again. 

Although we do have an Online and Healthy system again, because the system is pretty full, there is no room for Auto Healing when another node is failing again.

The best way to solve this is to expand the cluster again with a node. 

 

 

Pretty Cool, huh ? If you have any comments or questions, please ask.

8 Comments

  1. Louie
  2. Rahul Katyal
  3. Saurabh
  4. Ali

Leave a Reply