If you’re seeing issues with ESXi hosts disconnecting from vCenter every ~60 seconds for a very brief period of time and immediately reconnecting, while not having any impact on VM network traffic (i.e., nothing else is going offline or dropping packet’s), read on. Chances are your issues relate to vCenter not receiving
ESXi hosts send UDP heartbeats to vCenter (destination port 902) every 10 seconds. By default, if vCenter doesn’t see one of these heartbeats in 60 seconds the host goes into a
Disconnected state, though typically recovers almost immediately unless there are other network issues.
Confirming the issue
We can verify that this the root cause of our disconnections in multiple ways, but these two are my preferred.
- vCenter Web UI
Verifying ESXi host disconnections via vCenter Web UI
vCenter > Monitor > Events and filter for
not responding in the Description field. You should see the host(s) in question disconnecting every 60 seconds.
Verifying ESXi host disconnections via tcpdump
SSH into vCenter (you will need shell access enabled) and run
tcpdump -n udp dst portrange 902-902. This will show all UDP packets received destined to port 902. This will be the heartbeat traffic.
In this case we have two groups of hosts, the first on 10.250.6.0/24, which are working as expected, and the second to on 10.250.7.0/24, which are experiencing the disconnection issues. Let’s have a look at the output. In this example our vCenter appliance is at 10.250.10.25.
root@vc1 [ ~ ]# tcpdump -n udp dst portrange 902-902 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 16:25:20.831491 IP 10.250.6.12.42007 > 10.250.10.25.902: UDP, length 334 16:25:21.073519 IP 10.250.6.11.16964 > 10.250.10.25.902: UDP, length 334 16:25:30.837316 IP 10.250.6.12.11110 > 10.250.10.25.902: UDP, length 334 16:25:31.075792 IP 10.250.6.11.34471 > 10.250.10.25.902: UDP, length 334
We can see vCenter is receiving heartbeat packets from two hosts on the 10.250.6.0/24 network, but none from 10.250.7.0/24.
Next let’s verify our ESXi hosts are sending the heartbeat IPs (and that they’re being sent to the correct IP).
Verifying ESXi hosts are sending heartbeat packets
SSH into the ESXi host(s) and once again let’s run tcpdump:
tcpdump-uw -n udp dst portrange 902-902
[root@esxi1:~] tcpdump-uw -n udp dst portrange 902-902 tcpdump-uw: verbose output suppressed, use -v or -vv for full protocol decode listening on vmk0, link-type EN10MB (Ethernet), capture size 262144 bytes 06:16:42.248122 IP 10.250.7.11.36024 > 10.250.10.25.902: UDP, length 332 06:16:52.254367 IP 10.250.7.11.31275 > 10.250.10.25.902: UDP, length 332
So we can definitely see the packets are being sent out, and going to the correct IP, however, they’re not being received by vCenter.
At this point the issue was resolved by an update to the firewalls rules between the ESXi hosts and vCenter. Your root cause may be different, but either way you’ll be able to identify whether missing heartbeats are the cause of the disconnects.
Confirming the issue is resolved
Once the root cause has been been identified in your environment we can once again look at a packet dump on vCenter and confirm the packets are being received (as well as logs no longer indicating disconnects)
root@vc1 [ ~ ]# tcpdump -n udp dst portrange 902-902 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes 16:49:11.519034 IP 10.250.6.11.11804 > 10.250.10.25.902: UDP, length 334 16:49:11.560495 IP 10.250.6.12.14014 > 10.250.10.25.902: UDP, length 334 16:49:12.272670 IP 10.250.7.12.27974 > 10.250.10.25.902: UDP, length 332 16:49:12.345234 IP 10.250.7.11.15032 > 10.250.10.25.902: UDP, length 332
There we have it, heartbeats being received from hosts on both ESXi networks.