🔹 1. vSAN Troubleshooting (High Latency, Disk Failures, APD, PDL)
Scenario 1: vSAN Cluster Latency & Slow Performance
🔹 Symptoms:
- High read/write latency in vSAN.
- ESXi logs show “LSOM congestion detected”.
- VMs are experiencing slow disk performance.
🔹 Troubleshooting Steps:
1️⃣ Check vSAN Cluster Health
bashesxcli vsan health cluster get
- Look for Congestion, Resync Backlog, or Component Limits warnings.
2️⃣ Check if vSAN Disks Are Congested
bashesxcli vsan debug disk list
- If
Congestion
is above 60%, the cache tier is overloaded.
3️⃣ Analyze vSAN Disk Latency in esxtop
bashesxtop
- Press
d
(disk stats). - Check
DAVG/cmd
(Device Latency) andKAVG/cmd
(Kernel Latency).
4️⃣ Enable Adaptive Resync if High Resync Traffic Is Present
bashesxcli vsan perf stats get
- If resync is excessive, enable Adaptive Resync: bashCopyEdit
esxcfg-advcfg -s 1 /VSAN/ResyncTrafficThrottling
5️⃣ Verify Storage Policy Compliance
bashesxcli vsan policy get
- Ensure RAID-1 or RAID-5/6 is not overloaded with failures.
✅ Solution: If congestion is high:
- Add more cache disks to the cluster.
- Ensure vSAN Disk Groups are balanced across hosts.
- Reduce VM storage IOPS demand (use Storage I/O Control).
Scenario 2: vSAN Disks in “Absent” or “Degraded” State
🔹 Symptoms:
- vSAN objects show Absent/Degraded in UI.
- VMs on affected disks may freeze or crash.
🔹 Troubleshooting Steps:
1️⃣ Check vSAN Cluster Component State
bashesxcli vsan debug object list
- Look for
State: ABSENT
orDEGRADED
.
2️⃣ Find Affected vSAN Hosts and Disks
bashesxcli vsan storage list
3️⃣ Check if the Disk is Still Reachable
bashesxcli storage core device list | grep naa.XXXXXXXX
4️⃣ Manually Evacuate Data from Faulty Disk
bashesxcli vsan storage remove -s naa.XXXXXXXX
5️⃣ Re-add the Disk to vSAN
bashesxcli vsan storage add -d naa.XXXXXXXX
✅ Solution:
- If the disk is permanently failed, replace it and let vSAN rebalance automatically.
🔹 2. NVMe Storage Bottleneck (High Latency, Queue Depth Issues)
Scenario 3: NVMe Datastore Has High Latency and Queue Depth Issues
🔹 Symptoms:
- NVMe datastore latency >20ms.
- ESXi logs show “Nvme Queue Depth Exceeded”.
🔹 Troubleshooting Steps:
1️⃣ Check NVMe Drive Latency
bashesxcli storage core device stats get -d nvmeX
2️⃣ Check Queue Depth Settings
bashesxcli storage core device list -d nvmeX | grep "Queue Depth"
3️⃣ Increase NVMe Queue Depth (If Needed)
bashesxcli system settings advanced set -o /Disk/NVMe/QueueDepth -i 128
4️⃣ Enable NVMe Polling for Lower Latency
bashesxcli system settings advanced set -o /Disk/NVMe/Polling -i 1
✅ Solution:
- Increase queue depth for better NVMe performance.
- Consider upgrading the storage controller firmware.
🔹 3. iSCSI Disconnection Issues
Scenario 4: iSCSI LUNs Frequently Disconnect (APD/PDL Errors)
🔹 Symptoms:
- All Paths Down (APD) or Permanent Device Loss (PDL) errors.
- VMs freeze when accessing the iSCSI datastore.
🔹 Troubleshooting Steps:
1️⃣ Check iSCSI Session Status
bashesxcli iscsi session list
2️⃣ Verify Path State
bashesxcli storage core path list
3️⃣ Manually Restart iSCSI Service
bash/etc/init.d/iscsi restart
4️⃣ Change iSCSI Timeout Settings to Avoid APD
bashesxcli system settings advanced set -o /Disk/Advanced/APDTimeout -i 180
✅ Solution:
- Ensure iSCSI multipathing is configured correctly.
- Increase APD timeout to allow for brief storage disconnections.
🔹 4. Fibre Channel (FC) Errors
Scenario 5: FC LUNs Not Showing in ESXi
🔹 Symptoms:
- FC storage is not detected in ESXi.
- ESXi logs show “No FC paths available”.
🔹 Troubleshooting Steps:
1️⃣ Check if the FC Adapter is Recognized
bashesxcli storage san fc list
2️⃣ Check WWPN Visibility
bashesxcli storage san fc wwn list
3️⃣ Force Rescan of FC Adapters
bashesxcli storage core adapter rescan --adapter=vmhbaX
4️⃣ Check If FC LUNs Are Masked
bashesxcli storage core device world list
✅ Solution:
- Ensure FC zoning and masking are configured correctly on the storage array.
🔹 5. RDMA (NVMe-oF, RoCE) Performance Issues
Scenario 6: RoCE/NVMe-oF Performance Degradation
🔹 Symptoms:
- High latency despite using NVMe over Fabric (NVMe-oF) with RoCE.
🔹 Troubleshooting Steps:
1️⃣ Verify RoCE Is Enabled
bashesxcli system module list | grep roce
2️⃣ Check RDMA Bandwidth Utilization
bashesxtop
- Press
n
for network stats, check RDMA adapter usage.
3️⃣ Optimize RoCE Performance with ECN and PFC
- Enable Explicit Congestion Notification (ECN).
- Enable Priority Flow Control (PFC) on switches.
✅ Solution:
- Ensure RoCE network switches have lossless fabric enabled.