🔥 Advanced ESXi Storage Troubleshooting Guide (vSAN, NVMe, iSCSI, FC, RDMA, Congestion Issues) 🔥In-depth troubleshooting guides covering vendor-specific configurations, vSAN, NVMe, RDMA, NSX-T, and high-performance tuning

🔹 1. vSAN Troubleshooting (High Latency, Disk Failures, APD, PDL)

Scenario 1: vSAN Cluster Latency & Slow Performance

🔹 Symptoms:

High read/write latency in vSAN.
ESXi logs show “LSOM congestion detected”.
VMs are experiencing slow disk performance.

🔹 Troubleshooting Steps:
1️⃣ Check vSAN Cluster Health

bash
esxcli vsan health cluster get

Look for Congestion, Resync Backlog, or Component Limits warnings.

2️⃣ Check if vSAN Disks Are Congested

bash
esxcli vsan debug disk list

If Congestion is above 60%, the cache tier is overloaded.

3️⃣ Analyze vSAN Disk Latency in esxtop

bash
esxtop

Press d (disk stats).
Check DAVG/cmd (Device Latency) and KAVG/cmd (Kernel Latency).

4️⃣ Enable Adaptive Resync if High Resync Traffic Is Present

bash
esxcli vsan perf stats get

If resync is excessive, enable Adaptive Resync: bashCopyEditesxcfg-advcfg -s 1 /VSAN/ResyncTrafficThrottling

5️⃣ Verify Storage Policy Compliance

bash
esxcli vsan policy get

Ensure RAID-1 or RAID-5/6 is not overloaded with failures.

✅ Solution: If congestion is high:

Add more cache disks to the cluster.
Ensure vSAN Disk Groups are balanced across hosts.
Reduce VM storage IOPS demand (use Storage I/O Control).

Scenario 2: vSAN Disks in “Absent” or “Degraded” State

🔹 Symptoms:

vSAN objects show Absent/Degraded in UI.
VMs on affected disks may freeze or crash.

🔹 Troubleshooting Steps:
1️⃣ Check vSAN Cluster Component State

bash
esxcli vsan debug object list

Look for State: ABSENT or DEGRADED.

2️⃣ Find Affected vSAN Hosts and Disks

bash
esxcli vsan storage list

3️⃣ Check if the Disk is Still Reachable

bash
esxcli storage core device list | grep naa.XXXXXXXX

4️⃣ Manually Evacuate Data from Faulty Disk

bash
esxcli vsan storage remove -s naa.XXXXXXXX

5️⃣ Re-add the Disk to vSAN

bash
esxcli vsan storage add -d naa.XXXXXXXX

✅ Solution:

If the disk is permanently failed, replace it and let vSAN rebalance automatically.

🔹 2. NVMe Storage Bottleneck (High Latency, Queue Depth Issues)

Scenario 3: NVMe Datastore Has High Latency and Queue Depth Issues

🔹 Symptoms:

NVMe datastore latency >20ms.
ESXi logs show “Nvme Queue Depth Exceeded”.

🔹 Troubleshooting Steps:
1️⃣ Check NVMe Drive Latency

bash
esxcli storage core device stats get -d nvmeX

2️⃣ Check Queue Depth Settings

bash
esxcli storage core device list -d nvmeX | grep "Queue Depth"

3️⃣ Increase NVMe Queue Depth (If Needed)

bash
esxcli system settings advanced set -o /Disk/NVMe/QueueDepth -i 128

4️⃣ Enable NVMe Polling for Lower Latency

bash
esxcli system settings advanced set -o /Disk/NVMe/Polling -i 1

✅ Solution:

Increase queue depth for better NVMe performance.
Consider upgrading the storage controller firmware.

🔹 3. iSCSI Disconnection Issues

Scenario 4: iSCSI LUNs Frequently Disconnect (APD/PDL Errors)

🔹 Symptoms:

All Paths Down (APD) or Permanent Device Loss (PDL) errors.
VMs freeze when accessing the iSCSI datastore.

🔹 Troubleshooting Steps:
1️⃣ Check iSCSI Session Status

bash
esxcli iscsi session list

2️⃣ Verify Path State

bash
esxcli storage core path list

3️⃣ Manually Restart iSCSI Service

bash
/etc/init.d/iscsi restart

4️⃣ Change iSCSI Timeout Settings to Avoid APD

bash
esxcli system settings advanced set -o /Disk/Advanced/APDTimeout -i 180

✅ Solution:

Ensure iSCSI multipathing is configured correctly.
Increase APD timeout to allow for brief storage disconnections.

🔹 4. Fibre Channel (FC) Errors

Scenario 5: FC LUNs Not Showing in ESXi

🔹 Symptoms:

FC storage is not detected in ESXi.
ESXi logs show “No FC paths available”.

🔹 Troubleshooting Steps:
1️⃣ Check if the FC Adapter is Recognized

bash
esxcli storage san fc list

2️⃣ Check WWPN Visibility

bash
esxcli storage san fc wwn list

3️⃣ Force Rescan of FC Adapters

bash
esxcli storage core adapter rescan --adapter=vmhbaX

4️⃣ Check If FC LUNs Are Masked

bash
esxcli storage core device world list

✅ Solution:

Ensure FC zoning and masking are configured correctly on the storage array.

🔹 5. RDMA (NVMe-oF, RoCE) Performance Issues

Scenario 6: RoCE/NVMe-oF Performance Degradation

🔹 Symptoms:

High latency despite using NVMe over Fabric (NVMe-oF) with RoCE.

🔹 Troubleshooting Steps:
1️⃣ Verify RoCE Is Enabled

bash
esxcli system module list | grep roce

2️⃣ Check RDMA Bandwidth Utilization

bash
esxtop

Press n for network stats, check RDMA adapter usage.

3️⃣ Optimize RoCE Performance with ECN and PFC

Enable Explicit Congestion Notification (ECN).
Enable Priority Flow Control (PFC) on switches.

✅ Solution:

Ensure RoCE network switches have lossless fabric enabled.

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

🔹 1. vSAN Troubleshooting (High Latency, Disk Failures, APD, PDL)

Scenario 1: vSAN Cluster Latency & Slow Performance

Scenario 2: vSAN Disks in “Absent” or “Degraded” State

🔹 2. NVMe Storage Bottleneck (High Latency, Queue Depth Issues)

Scenario 3: NVMe Datastore Has High Latency and Queue Depth Issues

🔹 3. iSCSI Disconnection Issues

Scenario 4: iSCSI LUNs Frequently Disconnect (APD/PDL Errors)

🔹 4. Fibre Channel (FC) Errors

Scenario 5: FC LUNs Not Showing in ESXi

🔹 5. RDMA (NVMe-oF, RoCE) Performance Issues

Scenario 6: RoCE/NVMe-oF Performance Degradation

Related Posts