Refer Index page for basic principles, HCX command line basics and access other parts in the series.
This part of the troubleshooting series can be sub-divided into following sections.
1. Network Profiles
Refer user guide for HCX Network profile concepts and considerations. Below are some of the issues typically seen during the network profile creation and potential fixes.
Issue Management and vMotion portgroups are not showing up under the list of networks to be used for the network profile
Resolution The management and vMotion portgroups are created as ‘VMkernel’ type portgroup which cannot be used for the virtual machines (HCX appliances are deployed as VMs). Create a new Virtual Machine portgroup under the same virtual switch. This should be a simple step and wouldn’t need additional changes.

2. Compute Profiles
Compute profiles are generally straight forward and do not have issues while creating. One key aspect is to define correctly the ‘Service Cluster’ that has the target workloads to be migrated, and ‘deployment cluster’ in which the service mesh appliances will be deployed. This is defined well in the User Guide.
3. Service Mesh Creation
Service mesh is when the Interconnect (migration), WanOpt and Network Extension appliances are deployed. The images are downloaded from HCX image depot in the Internet. The traffic flow involved in downloading the images is shown below –

Possible Causes If the download itself fails, check whether the HCX Connector has Internet access (outbound https) either directly or via proxy.
The mesh creation process also involves importing the image into vSphere, which needs access to ESXi management IP. Basically the internal green flows should be possible from the HCX Connector.

If there are issues during this step, confirm if the ESXi Management IP is reachable from the Connector
[admin@hcx-manager-enterprise ~]$ telnet esx-11 443
Trying 10.yy.xx.11...
Connected to esx-11.
Escape character is '^]'.
Symptoms Even when the port is open and when HCX Connector is in the same subnet as the management hosts, the service mesh deployment still fails. Few examples below.
“Service Mesh creation failed. Deploy and Configuration of Interconnect Appliances Failed. Interconnect Service Workflow DeployAppliance failed. Error: [“Interconnect Service Workflow OvfUpload failed“
“Service Mesh creation failed. Interconnect Service Workflow GenerateAndPostConfig failed. ErrorL: [“File Upload is unsuccessful”,”File Upload is unsuccessful”]“
Cause Primary reason for these issues is the HCX Connector uses proxy for all https connection and goes through the proxy for local traffic as well.
Resolution Add exclusions as mentioned in Part 1 of the series
There could be additional errors for which KB articles are available
Refer KB79003 for “Mobility Agent deployment fails with vCenter certificate management set to “custom” mode”
Refer KB83303 for “Failed to authenticate with the guest operating system using the supplied credentials”
4. HCX Tunnels
Below is the traffic flow for HCX tunnels

Get the local and remote IPs from under the Service Mesh > Appliances page, to check the firewall logs. Appliances are deployed in pairs and connect with each other, IX Initiator1 (I1) on-prem connects to IX Receiver1 (R1) in cloud for example.

Symptoms Appliances in the mesh are deployed, but the tunnels don’t come up.
Possible Causes Number one reason for HCX tunnels not coming up is the security policies. Firewall rules are either implemented incorrectly, or the traffic is not allowed at all.
How to troubleshoot? HCX IX and NE appliances will continuously send IPsec traffic on port 4500 and this should be visible in the Firewall logs. If the firewall does not see this traffic, check the routing.
In a working system request and responses are seen from both local and remote IPs. In the below example, response is not seen from the remote IP.
[admin@hcx-manager-enterprise ~]$ ccli
Welcome to HCX Central CLI
[admin@hcx-manager-enterprise] list
|----------------------------------------------------------------------------------|
| Id | Node | Address | State | Selected |
|----------------------------------------------------------------------------------|
| 0 | OnPrem-AVS-IX-I1 | 10.xx.yy.50:9443 | Connected |
| 1 | OnPrem-AVS-NE-I1 | 10.xx.yy.54:9443 | Connected |
| 2 | OnPrem-Test-IX-I1 | 10.xx.yy.52:9443 | Connected |
| 3 | OnPrem-Test-NE-I1 | 10.xx.yy.53:9443 | Connected |
[admin@hcx-manager-enterprise] go 2
Switched to node 2.
[admin@hcx-manager-enterprise] ssh
Welcome to HCX Central CLI
[root@OnPrem-Test-IX-I1 ~]# tcpdump -c 10 -ni vNic_0 udp and port 4500
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vNic_0, link-type EN10MB (Ethernet), capture size 262144 bytes
08:48:48.126597 IP 10.xx.yy.52.4500 > 10.xx.yy.36.4500: UDP-encap: ESP(spi=0x45000054,seq=0x1ee04000), length 84
..
08:48:54.648210 IP 10.xx.yy.52.4500 > 10.xx.yy.36.4500: UDP-encap: ESP(spi=0x45000054,seq=0x30814000), length 84
08:48:55.148946 IP 10.xx.yy.52.4500 > 10.xx.yy.36.4500: UDP-encap: ESP(spi=0x45000054,seq=0x311c4000), length 84
08:48:55.649733 IP 10.xx.yy.52.4500 > 10.xx.yy.36.4500: UDP-encap: ESP(spi=0x45000054,seq=0x31b84000), length 84
10 packets captured
10 packets received by filter
0 packets dropped by kernel
[root@OnPrem-Test-IX-I1 ~]# exit
logout
Communication is seen both ways in a working system –
[admin@hcx-manager-enterprise ~]$ ccli
Welcome to HCX Central CLI
[admin@hcx-manager-enterprise] go 0
Switched to node 0.
[admin@hcx-manager-enterprise] ssh
Welcome to HCX Central CLI
[root@OnPrem-AVS-IX-I1 ~]# tcpdump -c 10 -ni vNic_0 udp and port 4500
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vNic_0, link-type EN10MB (Ethernet), capture size 262144 bytes
08:53:59.829110 IP 10.xx.yy.2.4500 > 10.xx.yy.50.4500: UDP-encap: ESP(spi=0x4500009c,seq=0x52180000), length 156
08:54:00.503192 IP 10.xx.yy.2.4500 > 10.xx.yy.50.4500: UDP-encap: ESP(spi=0x4500009c,seq=0x54090000), length 156
08:54:00.503428 IP 10.xx.yy.50.4500 > 10.xx.yy.2.4500: UDP-encap: ESP(spi=0x4500009c,seq=0xa03c0000), length 156
08:54:00.503195 IP 10.xx.yy.2.4500 > 10.xx.yy.50.4500: UDP-encap: ESP(spi=0x4500009c,seq=0x1be0000), length 156
..
08:54:00.545396 IP 10.xx.yy.50.4500 > 10.xx.yy.2.4500: UDP-encap: ESP(spi=0x4500009c,seq=0x3ca60000), length 156
10 packets captured
10 packets received by filter
0 packets dropped by kernel
Firewall quirks Few customers had to change the firewall policy from application based rule (‘IPsec’) to explicitly allowing the port (UDP 4500) to bring the tunnel up. In some cases some of the inspection policy has to be modified – so double check the firewall configuration as well.
Thanks to VMware HCX team, field teams, PMs, support, customers for all of the above inputs.
2 thoughts