Remote Proof of Concept testing seems to be gaining in popularity recently. The major difference in on-site vs remote testing is the access to HW to test drive unplug or physical network failure. What I use in case of disk failure testing in a vSAN cluster is vSAN Disk Fault Injection script that is available on ESXi. There is no need to download anything, it is there by default, check your /usr/lib/vmware/vsan/bin
path but use the script for POC/homelab only.
![](https://www.softwaredefinedblog.com/wp-content/uploads/2020/05/image-5-1024x461.png)
We need to have a device id do run the script, we can test a cache or capacity drive per chosen disk group. In the example below I picked mpx.vmhba2:C0:T0:L0 which was a cache drive (Is Capacity Tier:false
).
You can use esxli vsan storage list
for that:
![](https://www.softwaredefinedblog.com/wp-content/uploads/2020/05/Screen-Shot-2020-05-21-at-3.39.22-PM-790x1024.png)
Or check in the vCenter console under Storage Devices:
![](https://www.softwaredefinedblog.com/wp-content/uploads/2020/05/Screen-Shot-2020-05-21-at-3.40.46-PM-1024x279.png)
Or under Disk Management:
![](https://www.softwaredefinedblog.com/wp-content/uploads/2020/05/Screen-Shot-2020-05-21-at-3.41.57-PM-1024x631.png)
python vsanDiskFaultInjection.pyc
has the following options:
![](https://www.softwaredefinedblog.com/wp-content/uploads/2020/05/Screen-Shot-2020-05-21-at-3.47.24-PM-1024x429.png)
I am using -u
for injecting a hot unplug.
![](https://www.softwaredefinedblog.com/wp-content/uploads/2020/05/Screen-Shot-2020-05-21-at-3.54.56-PM-1024x85.png)
/var/log/vmkernel.log
is the place you can verify the disk status:
![](https://www.softwaredefinedblog.com/wp-content/uploads/2020/05/Screen-Shot-2020-05-21-at-3.55.15-PM.png)
vSAN-> Disk Management will also show what is going on with a disk group that faced a drive failure.
![](https://www.softwaredefinedblog.com/wp-content/uploads/2020/05/Screen-Shot-2020-05-21-at-3.55.51-PM-1024x522.png)
And now we can observe the status of the data and the process of resyncing objects due to “compliance”.
![](https://www.softwaredefinedblog.com/wp-content/uploads/2020/05/Screen-Shot-2020-05-21-at-4.15.24-PM-1024x527.png)
After we are done with the testing, simple scan for new storage devices on the host will solve the issue.