Basic HCX diagnostics

HCX is more just one component, but the main one is called HCX Manager. It is deployed as first and it is the one you can login to using https://FQDN_OR_IP:9443. Web UI is always a first step in troubleshooting because you can quickly check or restart services. And the most important, start SSH service to get to the console.

> ccli

Welcome to HCX Central CLI

Few simple commands that you can ran are:

> list

To list the connected service appliances:

> go 0

to select a specific appliance

> hc -d

to run a detailed healtcheck on the selected appliance, like this one…it is in a pretty bad shape:

> ssh

to connect to a selected appliance (no username and password required) to check networking, routing etc but also to view the logs. For Interconnect appliance (HCX-WAN-IX) /var/log/vmware/hbrsrv.log and /var/log/vmware/mobilityagent.log are the most valuable in troubleshooting.

To leave ccli just type > exit.

On HCX Manager the best destination for log analysis is: /common/logs/admin/app.log , /common/logs/admin/job.log and /common/logs/admin/web.log.

The most common issues that may occur during setup are mostly networking ones around interconnect between sites, Management Network, Uplink Network and vMotion Network.

And HCX Plugin in vCenter will show the following: tunnel status down

We can go through a very long list checking open ports running > ping, > netcat -vz : https://ports.vmware.com/home/VMware-HCX

We can take a shortcut as well (not sure if this is supported method but I believe we are good to go if we only want to edit something) and check HCX Mongo DB:

> mongo hybridity

> show collections

will list all the tables in the database. The table that is worth checking is the following (from what I checked it works on HCX Cloud connector/on-prem site where you can RUN DIAGNOSTICS on service mesh ):

> db.ServiceMeshDiags.find().pretty()

Look for entries:

"message" : "Diagnostics completed. There are 7 failed probes.",
"status" : "FAILED",

------------------------------

"status" : "FAILURE",
"error" : {
"output" : "",
"message" : "Failed to reach destination"
}
}
],
"status" : "ERROR",
"message" : "HCX-NET-EXT is unable to reach HCX-NET-EXT-PEER on the ports 4500. Please ensure firewall is not blocking the ports or routing is correctly configured."

------------------------------

{
"type" : "REACHABILITY_HTTPS_CONNECT",
"source" : "x.x.x.x",
"destination" : "x.x.x.x",
"sourcePort" : 0,
"destPort" : 443,
"destType" : "HCX-WAN-IX",
"protocol" : "TCP",
"status" : "FAILURE",
"error" : {
"output" : "",
"message" : "Failed to connect to target"
},

status" : "ERROR",
"message" : "HCX is unable to reach HCX-WAN-IX on the ports 443. Please ensure firewall is not blocking the ports or routing is correctly configured."

-----------------------------

"type" : "REACHABILITY_TCP_CONNECT",
"source" : "x.x.x.x",
"destination" : "x.x.x.x",
"sourcePort" : 0,
"destPort" : 8000,
"destType" : "Deployment_HostSystem",
"protocol" : "TCP",
"status" : "FAILURE",
"error" : {
"output" : "dial tcp x.x.x.x:8000: connect: no route to host",
"message" : ""
},

This table is a real time-saver!

How to check your vSAN Health history?

Skyline/vSAN Health is the first place we go when we troubleshoot a vSAN cluster. But in case we have a problem that appears from time to time, it can happen that an updated vSAN Health report will not indicate any problems.

One of the simplest ways to check the historical vSAN Health data is to check them in your vCenter’s log file:

/var/log/vmware/vsan-health/vmware-vsan-health-summary-result.log

Is there always just one witness component for a vSAN object with mirror policy?

Some time ago we looked into a rare case where a vSAN object (VMDK) with FTT-1 mirror SPBM policy didn’t require a witness component.

So usually with FTT-1 mirror policy we have two components of the object and a witness component. What do you think happens with a witness component when instead of FTT-1, FTT-2 SPBM policy is assigned to an object ? FTT-2 means our object can survive a failure of its two components. For a VMDK object it means there will be three copies of this VMDK and two can be inaccessible without affecting VM’s I/O traffic.

For FTT-2 there will be more than just one witness component…

A VMDK will have three copies + 2 witness components. That is why FTT-2 policy requires minimum of 2n+1 = 5 nodes / ESXi hosts. And any two out of 5 can be inaccessible without affecting the service.

VMDK object with SPBM: FTT-2 mirror

How about FTT-3? FTT-3 means our object can survive a failure of its three components. For a VMDK object it means there will be four copies of this VMDK and three can be inaccessible without affecting VM’s I/O traffic.

A VMDK will have four copies + 3 witness components. That is why FTT-3 policy requires minimum of 2n+1 = 7 nodes / ESXi hosts. And any three out of 7 can be inaccessible without affecting the service.

VMDK object with SPBM: FTT-3 mirror

esxcli vsan debug object list is a command that can be used to list the components of the object , here is how it looks for a VMDK with FTT-3 SPBM policy:

esxcli vsan debug object list command

vSAN disk removal & evacuation throughput

How quickly can vSAN evacuate data from my disk? Great question and of course the answer is “it depends”. On disks performance, network performance but also on the current I/O load of the cluster.

It seems there are a lot of dependencies, but vSAN has it covered. Resync dashboards, resync throttling (use with care), fairnes scheduler, performance dashboards, pre-checks and esxtop – all of those are there to make the data evacuation easier, regardless of the reason of the evacuation. It can be a migration to a larger disk, disk group rebuild or moving from hybrid to all-flash cluster.

Let’s check those tools.

Fairness scheduler is a mechanizm that keeps a healthy balance between frontend I/O (VM I/O) and backend I/O (policy changes, evacuations, rebalancing, repairs). It is very important because the disk group throughput has to be shared between many types of I/O traffic and on one hand we want the resync to complete as fast as possible, on the other, it cannot affect VM I/O.

vSAN can distinguish traffic types, prioritize the traffic and adapt to the current condition of the cluster.

It is like in the picture below. If there is no contention, resync or VM I/O use “full speed”. If VMs need more of the disk throughput, resync I/O is throttled for this period of time to max 20% of the disk throughput.

Pre-check evacuation option will help you to determine what components of the object can be potentially affected by the disk evacuation. You can select the preferred ‘vSAN data migration’ option: this can be a full migration, ensuring the accessibility of the object or no data migration at all.

The potential status of the components can vary. The pre-check will also let you know if you have enough space to move the data. Non-Compliant status of the object means the object will be accessible but it will not be compliant to the vSAN Storage Policy. At least for some time till the rebuild is finished.

You can use resyncing objects dashboard to monitor the status if the vSAN background activities. With each activity you can check the reason for it. It can be a decommissioning like on the screen below when we remove a disk and evacuate data.

The full list of the reasons (from Resyncin objects tab):

  • Decommissioning: The component is created and is resyncing after evacuation of disk group or host, to ensure accessibility and full data evacuation.
  • Compliance: The component is created and is resyncing to repair a bad component, or after vSAN object was resized, or its policy was changed.
  • Disk evacuation: The component is being moved out because a disk is going to fail.
  • Rebalance: The component is created and is resyncing for cluster rebalancing.
  • Stale: Repairing component which are out of sync with it’s replica.
  • Concatenation merge: Cleanup concatenation by merging them. Concatenation can be added to support the increased size requirement when object grows.

You can also set Resync Throttling, if you wish to manually control the bandwidth. Usually with vSAN fairness scheduler we do not need to modify anything, but it is good to know we can control it in case the need arises.

Frontend and backend traffic can be monitored with Performance Dashboard. It is very detailed. We can observe the traffic on the host level, disk group level or even on the disk level. All the traffic types are presented there.

If 5-min granularity is not enough (this is what is being offered by vCenter), we can always ssh to an ESXi. esxtop + x will show near real-time vSAN stats per host. To be able to observe rebuild traffic we need to use option ‘f’ and select ‘D’ for ‘recovery write stats’.

vSAN Health can be also used when we evacuate data from a disk. vSAN Object Health will show us how many objects are being rebuilt. And vSAN Health is also always a good place to check the overall status of the cluster.

Traffic Filtering as another great tool for remote POC

Traffic Filtering and Marking  is a vSphere Distributed Switch feature that protects virtualized systems from unwanted traffic and allows to apply QoS tags to certain type of traffic.

Distributed Switch requires Sphere Enterprise Plus license. But if you have vSAN, vDS is included in vSAN license so it will work on vSphere Standard license as well. 

This tool is also great for a remote POC or your home lab. It will help you to invoke a controlled vSAN host partition or isolation and learn the behavior of HA or vSAN cluster. Or verify if HA your configuration is correct. This is a great tool especially for testing vSAN stretched cluster behaviour. 

Sometimes creating a full site partition requires to involve a networking team, that may not be available on demand. This tool can help a virtualization team to test various scenarios on their own.

In the example below I used two DROP rules to discard the traffic coming to esx12 and esx13 in my 4-host cluster.

This way I created three cluster partitions: esx12, esx13 and the rest of the cluster (esx10 and esx11).

Right after I disabled the rules, cluster got back to normal.