vSAN 7.0 U1 – new features spotted (part 1)

While browsing the familiar vCenter UI right after an upgrade from 6.7U3 to 7.0U1 I noticed some small but nice changes I wanted to share.

RAID_D for integrity

When we put a host in Maintenance Mode with Ensure Accessibility option we will see a new RAID type ->RAID_D. It guarantees that all new write operations have two copies of data (for data integrity). In the example below esxi-59 is in MM, so all new writes go to esxi-74 (as a backup for esxi-59) and to esxi-79. Note that Witness object in this example is also on esxi-74 node as the RAID_D component. It is because we have 3 node vSAN cluster.

Compression ONLY

If the workload is not dedupe – friendly but can benefit from compression, now there is an option to enable Compression only on vSAN all-flesh datastore. The bonus is (in comparison to dedupe & compression) that in case of a failure of a capacity disk, whole disk group will NOT be unmounted.

Datastore Sharing aka HCI Mesh

If you have more than 2 clusters under the same vCenter, you can mount vSAN datastore from the remote cluster and use as a spare capacity for VMs.

IOINSIGHT

Now the famous VMware fling IOInsight is integrated with vCenter UI performance tab. We can run an instance of it on the selected target hosts and monitor in detail I/O performance of VMs.

I/O Performance comparison between VMs

We can now pick up to 10 VMs and compare I/O performance between them for more detailed analysis on how does a certain VM behaves in comparison to others on a shared chart or use separated charts.

vSAN upgrade to 7.0 U1 – format change

When you upgrade your cluster to 7.0 U1 you get a new disk format version 13.0. And like always, if it is an upgrade from recent vSAN versions like 6.7, no data evacuation is required, it is metadata change only.

I was sure my update will be quick as previous ones but not this time ;-). I was surprised by a new activity in the Resyncing Objects dashboard called “Format change”.

It turns out that for 7.0 U1 objects that are greater than 255GB have to be rewritten to a new format. My cluster had many of such objects (like VMs with VMDKs of 1 TB ) so it took some time to change their format.

There is also a new Skyline Health test that shows how many objects need a new format:

Why the format change is needed? It is a kind of internal optimisation to allow vSAN to change policies or rebuild with less than 25-30% free space required. So slack space can be smaller from now on which leaves more capacity for workloads.

Basic HCX diagnostics

HCX is more just one component, but the main one is called HCX Manager. It is deployed as first and it is the one you can login to using https://FQDN_OR_IP:9443. Web UI is always a first step in troubleshooting because you can quickly check or restart services. And the most important, start SSH service to get to the console.

> ccli

Welcome to HCX Central CLI

Few simple commands that you can ran are:

> list

To list the connected service appliances:

> go 0

to select a specific appliance

> hc -d

to run a detailed healtcheck on the selected appliance, like this one…it is in a pretty bad shape:

> ssh

to connect to a selected appliance (no username and password required) to check networking, routing etc but also to view the logs. For Interconnect appliance (HCX-WAN-IX) /var/log/vmware/hbrsrv.log and /var/log/vmware/mobilityagent.log are the most valuable in troubleshooting.

To leave ccli just type > exit.

On HCX Manager the best destination for log analysis is: /common/logs/admin/app.log , /common/logs/admin/job.log and /common/logs/admin/web.log.

The most common issues that may occur during setup are mostly networking ones around interconnect between sites, Management Network, Uplink Network and vMotion Network.

And HCX Plugin in vCenter will show the following: tunnel status down

We can go through a very long list checking open ports running > ping, > netcat -vz : https://ports.vmware.com/home/VMware-HCX

We can take a shortcut as well (not sure if this is supported method but I believe we are good to go if we only want to edit something) and check HCX Mongo DB:

> mongo hybridity

> show collections

will list all the tables in the database. The table that is worth checking is the following (from what I checked it works on HCX Cloud connector/on-prem site where you can RUN DIAGNOSTICS on service mesh ):

> db.ServiceMeshDiags.find().pretty()

Look for entries:

"message" : "Diagnostics completed. There are 7 failed probes.",
"status" : "FAILED",

------------------------------

"status" : "FAILURE",
"error" : {
"output" : "",
"message" : "Failed to reach destination"
}
}
],
"status" : "ERROR",
"message" : "HCX-NET-EXT is unable to reach HCX-NET-EXT-PEER on the ports 4500. Please ensure firewall is not blocking the ports or routing is correctly configured."

------------------------------

{
"type" : "REACHABILITY_HTTPS_CONNECT",
"source" : "x.x.x.x",
"destination" : "x.x.x.x",
"sourcePort" : 0,
"destPort" : 443,
"destType" : "HCX-WAN-IX",
"protocol" : "TCP",
"status" : "FAILURE",
"error" : {
"output" : "",
"message" : "Failed to connect to target"
},

status" : "ERROR",
"message" : "HCX is unable to reach HCX-WAN-IX on the ports 443. Please ensure firewall is not blocking the ports or routing is correctly configured."

-----------------------------

"type" : "REACHABILITY_TCP_CONNECT",
"source" : "x.x.x.x",
"destination" : "x.x.x.x",
"sourcePort" : 0,
"destPort" : 8000,
"destType" : "Deployment_HostSystem",
"protocol" : "TCP",
"status" : "FAILURE",
"error" : {
"output" : "dial tcp x.x.x.x:8000: connect: no route to host",
"message" : ""
},

This table is a real time-saver!

How to check your vSAN Health history?

Skyline/vSAN Health is the first place we go when we troubleshoot a vSAN cluster. But in case we have a problem that appears from time to time, it can happen that an updated vSAN Health report will not indicate any problems.

One of the simplest ways to check the historical vSAN Health data is to check them in your vCenter’s log file:

/var/log/vmware/vsan-health/vmware-vsan-health-summary-result.log

Is there always just one witness component for a vSAN object with mirror policy?

Some time ago we looked into a rare case where a vSAN object (VMDK) with FTT-1 mirror SPBM policy didn’t require a witness component.

So usually with FTT-1 mirror policy we have two components of the object and a witness component. What do you think happens with a witness component when instead of FTT-1, FTT-2 SPBM policy is assigned to an object ? FTT-2 means our object can survive a failure of its two components. For a VMDK object it means there will be three copies of this VMDK and two can be inaccessible without affecting VM’s I/O traffic.

For FTT-2 there will be more than just one witness component…

A VMDK will have three copies + 2 witness components. That is why FTT-2 policy requires minimum of 2n+1 = 5 nodes / ESXi hosts. And any two out of 5 can be inaccessible without affecting the service.

VMDK object with SPBM: FTT-2 mirror

How about FTT-3? FTT-3 means our object can survive a failure of its three components. For a VMDK object it means there will be four copies of this VMDK and three can be inaccessible without affecting VM’s I/O traffic.

A VMDK will have four copies + 3 witness components. That is why FTT-3 policy requires minimum of 2n+1 = 7 nodes / ESXi hosts. And any three out of 7 can be inaccessible without affecting the service.

VMDK object with SPBM: FTT-3 mirror

esxcli vsan debug object list is a command that can be used to list the components of the object , here is how it looks for a VMDK with FTT-3 SPBM policy:

esxcli vsan debug object list command