Friday, November 12, 2010

Windows 2008: VSS - Deleting shadow copies results in offline disks

When using a VSS Hardware Provider on Windows 2008, I noticed that under certain circumstances, deleting shadow copies resulted in offline disks in Disk Manager and diskpart.

A "DELETE SHADOWS <xxx>" command in diskshadow.exe would always work, but a subsequent "LIST DISK" in diskpart.exe would show a disk as "Offline". Also, the VDS service would often freak out when this happened: it could just outright crash with a segmentation violation, or it would just complain in the event log.

Manually performing a rescan (using "Rescan Disks" in Disk Manager, or issuing the "rescan" command in diskpart) would always clear out the offline disk.

After an interesting support case with both the vendor of the VSS Hardware Provider as well as Microsoft, the explanation turned out to be quite simple.

When deleting shadow copies, it's the job of the hardware provider to instruct the storage array to mask away the LUNs containing the to-be-deleted shadow copy. After the hardware provider is done, Windows will automatically perform a disk rescan to get rid of the now no longer visible LUNs.

When using a storage driver of the Storport model, STORPORT.SYS will be involved in the disk rescan. It turned out that storport.sys has an undocumented cooldown on performing these disk rescans: after performing one, storport.sys will ignore any subsequent disk rescans for a period of roughly 30 seconds, in some cases possibly even up to 5 minutes!

So, what happens is (simplified):

  • A shadow copy is deleted, which results in a disk rescan. Everything works fine and the deletion is processed normally.
  • Some seconds later, another shadow copy is deleted. This also results in a disk rescan, but since storport.sys is still in its cooldown period it'll silently ignore this second rescan.
  • The LUN of this second shadow copy is now no longer visible to the system, but since storport.sys ignored the rescan Windows still thinks the LUN is there. Since Windows did unmount the volume successfully, the LUN is  marked offline. This causes some components to get confused, for example VDS.
  • After the storport.sys cooldown expired, any disk rescan will clear out the offline LUN.
 Note deleting several shadow copies in one go works just fine, e.g. using the "delete shadows set <xxx>" or "delete shadows all" command in diskshadow.exe. This will not trigger the problem: VSS will process the entire list of shadow copies, and only then one single disk rescan is performed to clear out all LUNs underlying the entire set of deleted shadow copies.



The workaround is simple: make sure that there are at least five minutes between each delete operation.