Maintenance Mode of ESXis with Zerto VRAs installed in a public cloud

For Zerto users, maintenance work on ESXis has always been a challenge. Zerto’s Virtual Replication Appliances (VRAs) are pinned to their dedicated hosts, so when we put a host in Maintenance Mode , VRAs can’t be auto-evacuated. VRAs even shouldn’t be evacuated as they can only work on their dedicated hosts. If the vSphere cluster is on-prem, we have many options to address this issue. We can power off VSAs, delete them, force migrate etc. We can stop and resume replications to make sure our data is consistent.

But in a public cloud with a shared responsibility model, a cloud provider is responsible for all maintenance work on ESXis but they don’t have access to your Zerto application. Imagine a situation when a cloud provider needs to replace or upgrade an ESXi node that is still operational and they want to evacuate this host. With VRA being pinned to the host, this evacuation won’t work for them. Cloud provider probably also will not want to power off a VRA appliance because they know it will break your replications. This situation can seriously delay every maintenance work on an ESXi in a public cloud.

What can you do as a Zerto admin if you are using a public cloud as your replication target?

It turns out Zerto offers a very nice feature called Workload Automation. You can enable it in the Site Settings of Zerto Virtual Manager (ZVM).

Workload Automation can detect when a host is entering MM and can “evacuate” (=power off) a VRA in such situations. It can also detect when a host exits MM and bring back a VRA into an operational state. Thanks to this feature, a cloud provider can perform any maintenance work on your hosts and it won’t break your Zerto’s setup.

There are also other very useful options. When a new node is added to a cluster (due to auto scaling policy or a node replacement), Zerto will detect this and install its VRA there. When a node is removed from a cluster, Zerto will remove it from its inventory.

I run a simple test to check how it works. I used a 4 node vSAN cluster. When I put one of he hosts esxi-793 in MM, I noticed Zerto shut its VRA appliance down on this host.

A new alert was raised in the ZVM UI console that one of its VRA appliances had been powered off.

I also noticed Zerto powered off not only VRA but also a helper appliance: VRAH.

When I exited ESXi esxi-793 from MM, Zerto detected it correctly and powered on the appliances.

It seems Zerto Workload Automation is a must have option to be ON when you are running your Zerto in a public cloud and you don’t want to delay maintenance work of your hosts.

vSAN Disk Fault Injection

Remote Proof of Concept testing seems to be gaining in popularity recently. The major difference in on-site vs remote testing is the access to HW to test drive unplug or physical network failure. What I use in case of disk failure testing in a vSAN cluster is vSAN Disk Fault Injection script that is available on ESXi. There is no need to download anything, it is there by default, check your /usr/lib/vmware/vsan/bin path but use the script for POC/homelab only.

We need to have a device id do run the script, we can test a cache or capacity drive per chosen disk group. In the example below I picked mpx.vmhba2:C0:T0:L0 which was a cache drive (Is Capacity Tier:false).

You can use esxli vsan storage list for that:

Or check in the vCenter console under Storage Devices:

Or under Disk Management:

python vsanDiskFaultInjection.pyc has the following options:

I am using -u for injecting a hot unplug.

/var/log/vmkernel.log is the place you can verify the disk status:

vSAN-> Disk Management will also show what is going on with a disk group that faced a drive failure.

And now we can observe the status of the data and the process of resyncing objects due to “compliance”.

After we are done with the testing, simple scan for new storage devices on the host will solve the issue.

Disabling vSAN kernel module

When you work with nested vSAN homelab installations that constantly suffer power loses and network issues you get to know tons of useful troubleshooting tricks ;-). vSAN data seems to survive all of these unexpected failures, it is just a cluster services that sometimes need a little help. But remember, feel free to explore new tricks in your homelab but always consult Technical Support when you are not sure about the results of the command you want to use in your production environment.

Recently I ran into an issue in my lab and I wanted to see if it is vSAN related. There is this option in ESXi to boot a host with selected modules disabled. When the host boots, you have to press Shift+O to be able to disable modules.

Here is how I disabled vsan module:

jumpstart.disable=vsan,lsom,plog,virsto,cmmds

And how to verify if the module is loaded:

esxcli system module list

vCenter recognized that the host in the cluster does not have its vSAN service enabled.

How to make a host to load back the vSAN module? Simply by restarting it. Although this host did not have a vSAN module at that moment, it was still in my vSAN cluster. The nice thing is that I got an additional notification from the vCenter that I had a partition in my cluster before the restart. Good to know…?

esxcli vsan cluster unicastagent list

This may often happen in a nested vSAN environments in our home labs. We play with networking, remove vSAN kernels, put vCenter down, remove hosts from vSAN cluster…and there is this one step too far that results in having all our objects inaccessible, including vCenter. To be able to access the data (it is stored securely on the disk groups) we need to re-create our cluster back again.

How this can be done without vCenter? vSAN works fine when vCenter is down, but what happens when vCenter IS actually down and cluster is broken or needs to be reconfigured?

vSAN Health in ESXi web interface is a good start to asses the “damage”. If all of the hosts are isolated, all of them will be masters of their own single-node vSAN. If we do not see any other hosts in Hosts tab, this means the host does not see any of its neighbors in the vSAN network.

What we could do next is to ssh to all esxi hosts and check the cluster status with the command: esxcli vsan cluster get.

This will confirm that hosts are isolated or will help us to determine how the cluster is partitioned.

vmkping -I vmkX x.x.x.x will always help us to check if this is a network problem of the nested host. In this scenario we assume network works fine, pings are successful but nested hosts somehow cannot form the cluster.

It is vCenter’s role to inform hosts about their vSAN neighbors when we form the cluster but in this case we need to do this manually.

We need to “inform” hosts about their neighborhood (vSAN uses unicast). On the screen below we see 4 vSAN 7.0 hosts with vmk2 tagged with vsan traffic.

Every host should have a list of other host in a cluster. We can check it using esxcli vsan cluster unicastagent list.

If the cluster runs fine, this command shows the complete list of the neighbors from the single host perspective. Here we can see esxi-13 seeing all three other hosts on their vSAN network on vmk2.

On the screen below we can see that the host esxi-10 sees only esxi-11 and esxi-12.

Assuming network is fine, vCenter is down and won’t help us with this issue, we need to fill gaps in the unicastagent lists manually. Just remember, never add the IP of the host whose table is being configured. Here is the command we have to use:

esxcli vsan cluster unicastagent add -t node -u <Host_UUID> -U true -a <Host_VSAN_IP> -p 12321

for esxi-10 we have to have esxi-11,12 & 13 on the list, for esxi-11 it will be esxi-10, 12, 13 etc.

If the lists are complete, cluster should instantly be recreated and objects available again. Check out the sub-cluster member count – it was 1 and now it is 4.

The cluster if formed back again and vCenter should be starting.

vSphere and vSAN 7.0 are GA!

Brand new vSphere and vSAN 7.0 binaries are available to download on my.vmware.com.

Check out the small sneak peak of 7.0, freshly installed on 4 vSAN all-flash hosts. We say goodbye to the old flash-based web client, we welcome VM hardware version 17 with watchdog timer (resetting the VM if the guest OS is no longer responding) and support for Precision Time Protocol, new re-written workload – centric DRS with scalable shares, vSAN memory consumption dashboards and many more….