vSAN Health Check: “After 1 additional host failure” – how it is calculated?

This Health Check is very useful because it predicts how the vSAN cluster will behave from the storage and component utilization perspective after one host fails.

I have been always wondering how it is calculated.

3-Node cluster

Let’s take 3 node cluster as an example. In this case, there would be no additional host to rebuild data, so I guess the Health Check will just decrease the size of the vsanDatastore.

My hosts have two disk groups, each one has a size of 2×3.64 TB (TiB actually), each host contributes 14,56 TB to the vsanDatastore.

Overall sum of the disks size is 3×14,56TB = 43,68 TB. File system overhead is 54,66 GB (should be no more than 3% of the cluster’s capacity). vsanDatastore capacity is 43,68TB-54,66TB=43,62TB.

When one host is offline, the capacity will be 2/3 of 43.62 TB = 29,08 TB, exactly how it is shown on in the report below.

What about data? If I put one of the hosts in MM, I will have no option to do a full migration. I can Ensure Accessibility – migrating my FTT=0 VMDKs before MM. But when the same host goes offline unexpectedly, I will loose access to those VDMKs. There will be no option to migrate and to rebuild.

It looks like the second option is taken info consideration, hosts could have slightly different disks utilisation and we do not know which host may potentially fail and which one should be taken as an example for this Health Check . This is 6.7.U1 environment, in 6.7.U3 this is more straightforward, because Health Check shows what happens when most utilized host fails.

After one randomly chosen hosts fails we will have roughly 2/3 of the data available. Out of 43,62 TB available, we have 15,71 TB of free space, so 43,62TB-15,71 TB= 27,91TB. ~2/3*27,91 TB=18,60 TB data is written.

18,60/29,08 = ~62%.

4+ node cluster

For larger number of hosts this Health Check calculates disk space utilization after a host failure + after the data is rebuilt.

Let’s say we have a 20 TB vsanDatastore on 4 identical ESXi hosts. This means each host contributes 5 TB of storage. If one host goes offline, vsanDatastore size will drop from 20TB to 15 TB.

Imagine we have 14 TB of data written on those 4 hosts. Roughly 3,5 TB per host. If one of the servers is offline, others have to take over +3,5 TB, roughly +1,17 TB each.

So after one host failure, vsanDatastore will be 15 TB and 3* (3,5TB + 1,17 TB) = 14,01 TB will be physically written. Our Health Check will report 14,01/15 = 93% of a datastore usage.

This Health Check will definitely be predicting a potential issue. Any (even hypothetical) datastore usage above 90% always requires admin’s attention.