Refer Index page for basic principles, HCX command line basics and access other parts in the series.

This part of the troubleshooting series can be sub-divided into following sections.

  1. Network Profiles
  2. Compute Profiles
  3. Service Mesh Creation
  4. HCX Tunnels

1. Network Profiles

Refer user guide for HCX Network profile concepts and considerations. Below are some of the issues typically seen during the network profile creation and potential fixes.

Issue Management and vMotion portgroups are not showing up under the list of networks to be used for the network profile

Resolution The management and vMotion portgroups are created as ‘VMkernel’ type portgroup which cannot be used for the virtual machines (HCX appliances are deployed as VMs). Create a new Virtual Machine portgroup under the same virtual switch. This should be a simple step and wouldn’t need additional changes.

VMs use virtual machine portgroup type, not vmkernel

2. Compute Profiles

Compute profiles are generally straight forward and do not have issues while creating. One key aspect is to define correctly the ‘Service Cluster’ that has the target workloads to be migrated, and ‘deployment cluster’ in which the service mesh appliances will be deployed. This is defined well in the User Guide.

3. Service Mesh Creation

Service mesh is when the Interconnect (migration), WanOpt and Network Extension appliances are deployed. The images are downloaded from HCX image depot in the Internet. The traffic flow involved in downloading the images is shown below –

HCX downloading mesh appliances from hybridity-depot

Possible Causes If the download itself fails, check whether the HCX Connector has Internet access (outbound https) either directly or via proxy.

The mesh creation process also involves importing the image into vSphere, which needs access to ESXi management IP. Basically the internal green flows should be possible from the HCX Connector.

HCX OVF import

If there are issues during this step, confirm if the ESXi Management IP is reachable from the Connector

[admin@hcx-manager-enterprise ~]$ telnet esx-11 443
Trying 10.yy.xx.11...
Connected to esx-11.
Escape character is '^]'.

Symptoms Even when the port is open and when HCX Connector is in the same subnet as the management hosts, the service mesh deployment still fails. Few examples below.

Service Mesh creation failed. Deploy and Configuration of Interconnect Appliances Failed. Interconnect Service Workflow DeployAppliance failed. Error: [“Interconnect Service Workflow OvfUpload failed

Service Mesh creation failed. Interconnect Service Workflow GenerateAndPostConfig failed. ErrorL: [“File Upload is unsuccessful”,”File Upload is unsuccessful”]

Cause Primary reason for these issues is the HCX Connector uses proxy for all https connection and goes through the proxy for local traffic as well.

Resolution Add exclusions as mentioned in Part 1 of the series

There could be additional errors for which KB articles are available

Refer KB79003 for “Mobility Agent deployment fails with vCenter certificate management set to “custom” mode

Refer KB83303 for “Failed to authenticate with the guest operating system using the supplied credentials”

4. HCX Tunnels

Below is the traffic flow for HCX tunnels

HCX Tunnels

Get the local and remote IPs from under the Service Mesh > Appliances page, to check the firewall logs. Appliances are deployed in pairs and connect with each other, IX Initiator1 (I1) on-prem connects to IX Receiver1 (R1) in cloud for example.

HCX appliances local and remote IPs

Symptoms Appliances in the mesh are deployed, but the tunnels don’t come up.

Possible Causes Number one reason for HCX tunnels not coming up is the security policies. Firewall rules are either implemented incorrectly, or the traffic is not allowed at all.

How to troubleshoot? HCX IX and NE appliances will continuously send IPsec traffic on port 4500 and this should be visible in the Firewall logs. If the firewall does not see this traffic, check the routing.

In a working system request and responses are seen from both local and remote IPs. In the below example, response is not seen from the remote IP.

[admin@hcx-manager-enterprise ~]$ ccli
Welcome to HCX Central CLI

[admin@hcx-manager-enterprise] list
|----------------------------------------------------------------------------------|
| Id | Node                              | Address          | State     | Selected |
|----------------------------------------------------------------------------------|
| 0  | OnPrem-AVS-IX-I1                  | 10.xx.yy.50:9443 | Connected |          
| 1  | OnPrem-AVS-NE-I1                  | 10.xx.yy.54:9443 | Connected |          
| 2  | OnPrem-Test-IX-I1                 | 10.xx.yy.52:9443 | Connected |          
| 3  | OnPrem-Test-NE-I1                 | 10.xx.yy.53:9443 | Connected |          

[admin@hcx-manager-enterprise] go 2
Switched to node 2.
[admin@hcx-manager-enterprise] ssh
Welcome to HCX Central CLI

[root@OnPrem-Test-IX-I1 ~]# tcpdump -c 10 -ni vNic_0 udp and port 4500
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vNic_0, link-type EN10MB (Ethernet), capture size 262144 bytes
08:48:48.126597 IP 10.xx.yy.52.4500 > 10.xx.yy.36.4500: UDP-encap: ESP(spi=0x45000054,seq=0x1ee04000), length 84
..
08:48:54.648210 IP 10.xx.yy.52.4500 > 10.xx.yy.36.4500: UDP-encap: ESP(spi=0x45000054,seq=0x30814000), length 84
08:48:55.148946 IP 10.xx.yy.52.4500 > 10.xx.yy.36.4500: UDP-encap: ESP(spi=0x45000054,seq=0x311c4000), length 84
08:48:55.649733 IP 10.xx.yy.52.4500 > 10.xx.yy.36.4500: UDP-encap: ESP(spi=0x45000054,seq=0x31b84000), length 84
10 packets captured
10 packets received by filter
0 packets dropped by kernel

[root@OnPrem-Test-IX-I1 ~]# exit
logout

Communication is seen both ways in a working system –

[admin@hcx-manager-enterprise ~]$ ccli
Welcome to HCX Central CLI
[admin@hcx-manager-enterprise] go 0
Switched to node 0.
[admin@hcx-manager-enterprise] ssh
Welcome to HCX Central CLI

[root@OnPrem-AVS-IX-I1 ~]# tcpdump -c 10 -ni vNic_0 udp and port 4500
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vNic_0, link-type EN10MB (Ethernet), capture size 262144 bytes
08:53:59.829110 IP 10.xx.yy.2.4500 > 10.xx.yy.50.4500: UDP-encap: ESP(spi=0x4500009c,seq=0x52180000), length 156
08:54:00.503192 IP 10.xx.yy.2.4500 > 10.xx.yy.50.4500: UDP-encap: ESP(spi=0x4500009c,seq=0x54090000), length 156
08:54:00.503428 IP 10.xx.yy.50.4500 > 10.xx.yy.2.4500: UDP-encap: ESP(spi=0x4500009c,seq=0xa03c0000), length 156
08:54:00.503195 IP 10.xx.yy.2.4500 > 10.xx.yy.50.4500: UDP-encap: ESP(spi=0x4500009c,seq=0x1be0000), length 156
..
08:54:00.545396 IP 10.xx.yy.50.4500 > 10.xx.yy.2.4500: UDP-encap: ESP(spi=0x4500009c,seq=0x3ca60000), length 156
10 packets captured
10 packets received by filter
0 packets dropped by kernel

Firewall quirks Few customers had to change the firewall policy from application based rule (‘IPsec’) to explicitly allowing the port (UDP 4500) to bring the tunnel up. In some cases some of the inspection policy has to be modified – so double check the firewall configuration as well.

Thanks to VMware HCX team, field teams, PMs, support, customers for all of the above inputs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s