How to run basic performance tests for HCX uplink interface

I believe the build-in HCX perftest tool should be used for every freshly deployed HCX Service Mesh before we start migrating VMs between sites. Although the test is just a benchmark (it uses iperf3, it is single threaded), it will give us an idea how fast the VM migration will be and what can be expected in production. With HCX perftest tool testing is easier than with native iperf3 because we don’t have to provide/remember any IP addresses of appliances on-prem and in the cloud ;-).

To start the test we have to ssh to HCX manager as admin and select the IX appliance we want to test:

>ccli

>list

> go x -> select your service mesh appliance

> perftest -> to check available options:

Available Commands:
all perftest uplink, ipsec, wanopt and site in one command
ipsec iperf3 perf testing against ipsec tunnels
perf iperf3 perf testing
reachability Ping remote peers to test reachability.
site iperf3 perf testing between sites
status Query the test status.
uplink iperf3 perf testing against uplink
wanopt tcpperf testing against WANOPT tunnels

Available flags are:

Flags:
-h, --help help for uplink
-i, --interval uint32 Interval in second to report. Default is 1 second. (default 1)
-m, --msgsize uint32 TCP maximum segment size to send.
-P, --parallel uint32 Number of parallel streams. Default is 1. (default 1)
-p, --port uint32 Listen port on server side. Default is 4500. -p 22 also allowed. (default 4500)
-T, --runtimeout uint32 Individual test duration in second. Default is 1 minute. (default 60)
-t, --timeout uint32 Total timeout in seconds. Default 10 min. (default 600)
-v, --verbose Show details during testing if set
.

PERFTEST SITE: GENERAL TUNNEL CHECK

>perftest site
++++++++++ StartTest ++++++++++

---------- Site-0 [192.0.2.33 >>> 192.0.2.34] ----------
Duration Transfer Bandwidth Retransmit
server workload started
[ 4] 0.00-30.00 sec 13.8 GBytes 3.96 Gbits/sec 365 sender
[ 4] 0.00-30.00 sec 13.8 GBytes 3.95 Gbits/sec receiver
Done

---------- Site-0 [192.0.2.33 <<< 192.0.2.34] ----------
Duration Transfer Bandwidth Retransmit
[ 4] 0.00-30.00 sec 14.8 GBytes 4.24 Gbits/sec 167 sender
[ 4] 0.00-30.00 sec 14.8 GBytes 4.23 Gbits/sec receiver
Done

The iperf3 native commands that are used for this test with default values :

iperf3 -c 192.0.2.34 -i 1 -p 9000 -P 1 -t 30

iperf3 -s -p 9000 -B 192.0.2.33

PERFTEST IPSEC: TEST INSIDE IPSEC

> perftest ipsec
++++++++++ StartTest ++++++++++

---------- Ipsec-0 [t_0, 192.0.2.37 >>> 192.0.2.45] ----------
Duration Transfer Bandwidth Retransmit
server workload started
[ 4] 0.00-30.00 sec 3.40 GBytes 973 Mbits/sec 0 sender
[ 4] 0.00-30.00 sec 3.39 GBytes 972 Mbits/sec receiver
Done

---------- Ipsec-0 [t_0, 192.0.2.37 <<< 192.0.2.45] ----------
Duration Transfer Bandwidth Retransmit
[ 4] 0.00-30.00 sec 3.40 GBytes 974 Mbits/sec 0 sender
[ 4] 0.00-30.00 sec 3.40 GBytes 973 Mbits/sec receiver
Done

---------- Ipsec-1 [t_0, 192.0.2.38 >>> 192.0.2.46] ----------
Duration Transfer Bandwidth Retransmit
server workload started
[ 4] 0.00-30.00 sec 3.40 GBytes 973 Mbits/sec 0 sender
[ 4] 0.00-30.00 sec 3.40 GBytes 973 Mbits/sec receiver
Done

---------- Ipsec-1 [t_1, 192.0.2.38 <<< 192.0.2.46] ----------
Duration Transfer Bandwidth Retransmit
[ 4] 0.00-30.00 sec 3.40 GBytes 974 Mbits/sec 0 sender
[ 4] 0.00-30.00 sec 3.40 GBytes 973 Mbits/sec receiver
Done

---------- Ipsec-2 [t_2, 192.0.2.39 >>> 192.0.2.47] ----------
Duration Transfer Bandwidth Retransmit
server workload started
[ 4] 0.00-30.00 sec 3.39 GBytes 971 Mbits/sec 0 sender
[ 4] 0.00-30.00 sec 3.39 GBytes 970 Mbits/sec receiver
Done

---------- Ipsec-2 [t_2, 192.0.2.39 <<< 192.0.2.47] ----------
Duration Transfer Bandwidth Retransmit
[ 4] 0.00-30.00 sec 3.39 GBytes 971 Mbits/sec 1181 sender
[ 4] 0.00-30.00 sec 3.39 GBytes 970 Mbits/sec receiver
Done

The iperf3 native commands that are used for this test with default values :

iperf3 -c 192.0.2.45 -i 1 -p 9000 -P 1 -t 30

iperf3 -s -p 9000 -B 192.0.2.37

PERFTEST UPLINK: UPLINK INTERFACE CHECK

> perftest uplink

Testing uplink reachability…
Uplink-0 round trip time:
rtt min/avg/max/mdev = 66.734/67.081/68.135/0.578 ms

Uplink native throughput test is initiated from LOCAL site.
++++++++++ StartTest ++++++++++

---------- Uplink-0 [te_0, a.a.a.a >>> b.b.b.b] ----------
Duration Transfer Bandwidth Retransmit
server workload started
[ 4] 0.00-60.00 sec 5.20 GBytes 745 Mbits/sec 5116 sender
[ 4] 0.00-60.00 sec 5.20 GBytes 744 Mbits/sec receiver
Done
---------- Uplink-0 [te_0, a.a.a.a <<< b.b.b.b] ----------
Duration Transfer Bandwidth Retransmit
server workload started
[ 4] 0.00-60.00 sec 4.55 GBytes 652 Mbits/sec 6961 sender
[ 4] 0.00-60.00 sec 4.55 GBytes 651 Mbits/sec receiver
Done

The iperf3 native commands that are used for this test with default values :

iperf3 -c a.a.a.a -i 1 -p 4500 -P 1 -B b.b.b.b -t 60

iperf3 -c a.a.a.a -i 1 -p 4500 -P 1 -B b.b.b.b -t 60

Keep in mind that this is the only test that uses 4500 TCP port by default. If you have only 4500 UDP port open (this is the standard HCX Uplink requirement), your test will fail. You will see probably something like this:

"Command error occurs: Error calling peer [a.a.a.a.a:9445]: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp b.b.b.b:9445: connect: connection refused"

PERFTEST ALL: ALL TESTS COMBINED

This test will run iperf for uplink, ipsec, wanopt and site.

>perftest all
========== PERFTEST ALL STARTING ==========
== WanOpt is Present ==
== TOTAL # of TESTs : 11 ==
== ESTIMATED TEST DURATION : 12 minutes ==
-T option to change individual test duration [default 60 sec]
-k option to skip 'perftest uplink' if tcp port 4500 or 22 not opened
== Are you ready to start ?? [y/n]:

USEFUL FLAGS

You can use more streams to saturate the pipe (-P), but keep in mind the test uses a single thread.

>perftest site -P 2
++++++++++ StartTest ++++++++++

---------- Site-0 [ 192.0.2.33 >>> 192.0.2.34] ----------
Duration Transfer Bandwidth Retransmit
server workload started
[ 4] 0.00-60.00 sec 16.8 GBytes 2.40 Gbits/sec 1498 sender
[ 4] 0.00-60.00 sec 16.8 GBytes 2.40 Gbits/sec receiver
[ 6] 0.00-60.00 sec 16.4 GBytes 2.35 Gbits/sec 1815 sender
[ 6] 0.00-60.00 sec 16.4 GBytes 2.35 Gbits/sec receiver
[SUM] 0.00-60.00 sec 33.2 GBytes 4.76 Gbits/sec 3313 sender
[SUM] 0.00-60.00 sec 33.2 GBytes 4.75 Gbits/sec receiver
Done
---------- Site-0 [ 192.0.2.33 <<< 192.0.2.34] ----------
Duration Transfer Bandwidth Retransmit
[ 4] 0.00-60.00 sec 19.0 GBytes 2.72 Gbits/sec 937 sender
[ 4] 0.00-60.00 sec 19.0 GBytes 2.72 Gbits/sec receiver
[ 6] 0.00-60.00 sec 19.5 GBytes 2.80 Gbits/sec 806 sender
[ 6] 0.00-60.00 sec 19.5 GBytes 2.79 Gbits/sec receiver
[SUM] 0.00-60.00 sec 38.5 GBytes 5.52 Gbits/sec 1743 sender
[SUM] 0.00-60.00 sec 38.5 GBytes 5.51 Gbits/sec receiver
Done

>perftest site -P 4
++++++++++ StartTest ++++++++++

---------- Site-0 [ 192.0.2.33 >>> 192.0.2.34] ----------
Duration Transfer Bandwidth Retransmit
server workload started
[ 4] 0.00-60.00 sec 9.22 GBytes 1.32 Gbits/sec 2108 sender
[ 4] 0.00-60.00 sec 9.21 GBytes 1.32 Gbits/sec receiver
[ 6] 0.00-60.00 sec 9.13 GBytes 1.31 Gbits/sec 2194 sender
[ 6] 0.00-60.00 sec 9.12 GBytes 1.31 Gbits/sec receiver
[ 8] 0.00-60.00 sec 9.20 GBytes 1.32 Gbits/sec 2288 sender
[ 8] 0.00-60.00 sec 9.19 GBytes 1.32 Gbits/sec receiver
[ 10] 0.00-60.00 sec 8.71 GBytes 1.25 Gbits/sec 2396 sender
[ 10] 0.00-60.00 sec 8.70 GBytes 1.25 Gbits/sec receiver
[SUM] 0.00-60.00 sec 36.3 GBytes 5.19 Gbits/sec 8986 sender
[SUM] 0.00-60.00 sec 36.2 GBytes 5.19 Gbits/sec receiver
Done
---------- Site-0 [ 192.0.2.33 <<< 192.0.2.34] ----------
Duration Transfer Bandwidth Retransmit
[ 4] 0.00-60.00 sec 10.2 GBytes 1.45 Gbits/sec 2071 sender
[ 4] 0.00-60.00 sec 10.1 GBytes 1.45 Gbits/sec receiver
[ 6] 0.00-60.00 sec 10.0 GBytes 1.43 Gbits/sec 1932 sender
[ 6] 0.00-60.00 sec 10.0 GBytes 1.43 Gbits/sec receiver
[ 8] 0.00-60.00 sec 10.2 GBytes 1.47 Gbits/sec 2149 sender
[ 8] 0.00-60.00 sec 10.2 GBytes 1.47 Gbits/sec receiver
[ 10] 0.00-60.00 sec 10.3 GBytes 1.47 Gbits/sec 2366 sender
[ 10] 0.00-60.00 sec 10.3 GBytes 1.47 Gbits/sec receiver
[SUM] 0.00-60.00 sec 40.7 GBytes 5.83 Gbits/sec 8518 sender
[SUM] 0.00-60.00 sec 40.7 GBytes 5.82 Gbits/sec receiver
Done

You can change MTU to test the best option (-m) and identify any MTU mismatch issues. You can also modify MTU settings in HCX Network Profile for Uplink profile.

> perftest site -m 1390
++++++++++ StartTest ++++++++++

---------- Site-0 [ 192.0.2.33 >>> 192.0.2.34] ----------
Duration Transfer Bandwidth Retransmit
server workload started
[ 4] 0.00-60.00 sec 30.6 GBytes 4.37 Gbits/sec 518 sender
[ 4] 0.00-60.00 sec 30.5 GBytes 4.37 Gbits/sec receiver
Done
---------- Site-0 [192.0.2.33 <<< 192.0.2.34] ----------
Duration Transfer Bandwidth Retransmit
[ 4] 0.00-60.00 sec 31.1 GBytes 4.46 Gbits/sec 270 sender
[ 4] 0.00-60.00 sec 31.1 GBytes 4.45 Gbits/sec receiver
Done

perftest site -m 9000
++++++++++ StartTest ++++++++++

---------- Site-0 [ 192.0.2.33 >>> 192.0.2.34] ----------
Duration Transfer Bandwidth Retransmit
server workload started
[ 4] 0.00-60.00 sec 29.4 GBytes 4.21 Gbits/sec 341 sender
[ 4] 0.00-60.00 sec 29.4 GBytes 4.20 Gbits/sec receiver
Done
---------- Site-0 [ 192.0.2.33 <<< 192.0.2.34] ----------
Duration Transfer Bandwidth Retransmit
[ 4] 0.00-60.00 sec 29.3 GBytes 4.19 Gbits/sec 307 sender
[ 4] 0.00-60.00 sec 29.2 GBytes 4.19 Gbits/sec receiver
Done

Extending a network with HCX Network Extension

Network Extension (NE) is a HCX service mesh appliance that helps to extend L2 network between two sites. It is used to provide network accessibility when migrating VMs between sites. Most popular use case is to use NE when migrating (via HCX or using other methods) VMs from on-prem site to cloud and back. It is also a little bit overused because the configuration is so easy and fast, we may want it stay there forever ;-). If this is the case, it is worth mentioning Mobility Optimised Networking (MON) NE feature would be needed for latency sensitive production workload. MON provides routing based on locality of source and destination VMs and prevents L2 Extension Tromboning. With MON VM in site B (remote) could communicate with other VMs in other segments without reaching site A where its gateway is located.

For my step by step demo I am using two locations: site A (on-prem) where network segment aga_test 10.99.99.1/24 is originally configured and site B (cloud) where the network aga_test will be extended. Site A uses NSX-T and DHCP is configured for my segment but NSX-T is not required, it can be any vSphere Distributed Switch VLAN/tagged network.

HCX-5 (site A, connector role) and HCX-1 (site B, manager role) are paired and NE service mesh appliances are deployed on both locations. NEs create unmanaged Encrypted Transport Tunnel between sites on the network link defined in Network Uplink Profile.

The goal is to enable L2 communication between vm1 in site A and vm2 in site B. Additional points are for making DHCP working on extended network.

aga_test is NSX-T 3.0 subnet: 10.99.99.1/24 with DHCP enabled
HCX service mesh with Network Extension appliance deployed between hcx-5 (site A) and hcx-1 (site B).
When NE appliance is deployed, we can create a Network Extension. Take a look at the description, “the default gateway for the network extension only exist at the origin site”, that is why MON may be useful.
We pick a network to extend from the list: aga_test.
This is the moment when we can enable MON. It is included in HCX Enterprise license. We provide gateway address and NE appliance that we want to use.
The network extension is ready in just a few minutes.
Service Mesh view provides more details on extended network: L2E_aga_test
vCenter on site B shows the extended network L2E_aga_test in the Network tab
Extended segment is visible in the Segments view in NSX-T on site B. Default Segment Security doesn’t allow DHCP so for the L2E_aga_test it has to be allowed.
Creating DHCP_Allow_Sec profile that allows to receive DHCP traffic for VMs on the extended network.
vm1 is deployed on Site A in aga_test network and has 10.99.99.107 address
vm2 is deployed on Site in L2E_aga_test extended network and got 10.99.99.131 address
vm1 pinging vm2
vm2 pinging vm1
The connectivity between vm1 and vm2 can be also verified using NSX-T Traceflow feature.

Basic HCX diagnostics

HCX is more just one component, but the main one is called HCX Manager. It is deployed as first and it is the one you can login to using https://FQDN_OR_IP:9443. Web UI is always a first step in troubleshooting because you can quickly check or restart services. And the most important, start SSH service to get to the console.

> ccli

Welcome to HCX Central CLI

Few simple commands that you can ran are:

> list

To list the connected service appliances:

> go 0

to select a specific appliance

> hc -d

to run a detailed healtcheck on the selected appliance, like this one…it is in a pretty bad shape:

> ssh

to connect to a selected appliance (no username and password required) to check networking, routing etc but also to view the logs. For Interconnect appliance (HCX-WAN-IX) /var/log/vmware/hbrsrv.log and /var/log/vmware/mobilityagent.log are the most valuable in troubleshooting.

To leave ccli just type > exit.

On HCX Manager the best destination for log analysis is: /common/logs/admin/app.log , /common/logs/admin/job.log and /common/logs/admin/web.log.

The most common issues that may occur during setup are mostly networking ones around interconnect between sites, Management Network, Uplink Network and vMotion Network.

And HCX Plugin in vCenter will show the following: tunnel status down

We can go through a very long list checking open ports running > ping, > netcat -vz : https://ports.vmware.com/home/VMware-HCX

We can take a shortcut as well (not sure if this is supported method but I believe we are good to go if we only want to edit something) and check HCX Mongo DB:

> mongo hybridity

> show collections

will list all the tables in the database. The table that is worth checking is the following (from what I checked it works on HCX Cloud connector/on-prem site where you can RUN DIAGNOSTICS on service mesh ):

> db.ServiceMeshDiags.find().pretty()

Look for entries:

"message" : "Diagnostics completed. There are 7 failed probes.",
"status" : "FAILED",

------------------------------

"status" : "FAILURE",
"error" : {
"output" : "",
"message" : "Failed to reach destination"
}
}
],
"status" : "ERROR",
"message" : "HCX-NET-EXT is unable to reach HCX-NET-EXT-PEER on the ports 4500. Please ensure firewall is not blocking the ports or routing is correctly configured."

------------------------------

{
"type" : "REACHABILITY_HTTPS_CONNECT",
"source" : "x.x.x.x",
"destination" : "x.x.x.x",
"sourcePort" : 0,
"destPort" : 443,
"destType" : "HCX-WAN-IX",
"protocol" : "TCP",
"status" : "FAILURE",
"error" : {
"output" : "",
"message" : "Failed to connect to target"
},

status" : "ERROR",
"message" : "HCX is unable to reach HCX-WAN-IX on the ports 443. Please ensure firewall is not blocking the ports or routing is correctly configured."

-----------------------------

"type" : "REACHABILITY_TCP_CONNECT",
"source" : "x.x.x.x",
"destination" : "x.x.x.x",
"sourcePort" : 0,
"destPort" : 8000,
"destType" : "Deployment_HostSystem",
"protocol" : "TCP",
"status" : "FAILURE",
"error" : {
"output" : "dial tcp x.x.x.x:8000: connect: no route to host",
"message" : ""
},

This table is a real time-saver!