Veeam Backup of DC-01 Failed
So I was checking on some backups I have running in Veeam Community Edition and noticed the backups for DC-01 have been failing for the past week with the following error:
12/6/2022 1:09:34 PM :: Processing DC-01 Error: Data error (cyclic redundancy check).
Failed to read data from the file [Z:\Hyper-V\Virtual Hard Disks\DC-01.vhdx].
Failed to upload disk. Skipped arguments: [shadowSpec>];
Agent failed to process method {DataTransfer.SyncDisk}.
Exception from server: Data error (cyclic redundancy check).
Failed to read data from the file [Z:\Hyper-V\Virtual Hard Disks\DC-01.vhdx].
Unable to retrieve next block transmission command. Number of already processed blocks: [4093].
I was still waking up when I noticed and I started to run a chkdsk
on DC-01. After about 20 minutes I was a little more awake and realized how foolish that was. The failure to read the VHDX file was occurring on the host, not in the virtual machine itself or the Veeam backup server. It's failing on HV-01. So I switch over to HV-01 and realize the VM is being stored on a hard drive that has started showing bad unreadable sectors. Ugh, that's so lame! So now my domain controller with FSMO roles has no recent backups and is living on a failing hard drive with unreadable sectors. DC-01 is presently running but I worry if it will book back up after a reboot. I try to migrate the virtual machine to another physical disk but it fails due to a read error. I get the same read error when I try to migrate it to a different host. So what are my options right now?
I could try to take a snapshot and migrate and boot up with that from a different host.
I could run a chkdsk
on HV-01's hard drive to see if it will fix the read error with DC-01.
I could decommission DC-01 and just build a new domain controller from scratch.
I don't feel trying to put time into fixing something that may or may not be fixable. This is one of the strengths with virtualization, when something breaks you can just recreate the virtual resources. So that's the route that I will go.
The first and most important step for the domain is to transfer FSMO roles from DC-01 to DC-02.
I run Get-ADDomain
and Get-Forest
on both DCs and confirm they both show DC-01 holding all five FSMO roles. So now I can run the command Move-AddDirectoryServerOpterationMasterRole -Identity "DC-02" PDCEmulator,RIDMaster,InfrastructureMaster,SchemaMaster,DomainNamingMaster
and I confirm that the move command completes. I run another round of Get-ADDomain
and Get-Forest
on both DCs and confirm they both recognize DC-02 now as the holder of all five FSMO roles. Cool, while I worked through moving the FSMO roles I had a thought that DC-01 might be more inclined to move to a different host if it is turned off and the virtual disk is not in use. So I decide to give it a try, and if it fails no biggie, I can continue decommissioning it and spinning up a new DC.
I move forward with shutting DC-01 down. I try to move DC-01 to HV-03 but still get the same read error. Oh well, it's best to just work in a new controller. From Windows Admin Center I create a new virtual machine named DC-03 hosted on HV-03. I assign it 2 vCPUs and 4GB of memory. I attach my Windows Server 2022 test ISO and power on the new virtual machine.
This ends up bringing the entire HV-03 to it's knees. Windows Admin Center and Hyper-V Manager are both unresponsive when trying to access HV-03. From PowerShell I try a Get-VM
command, which churns for a few seconds but also becomes unresponsive. CRTL + C
does not work. I have to force close the PowerShell window. I try a Stop-VM -Name DC-03
command which also churns for a few seconds and then becomes unresponsive. It's like the whole hypervisor has crashed because I notice other virtual machines on HV-03 are also unresponsive. I try to connect to iDRAC to perform a reboot of the host but iDRAC won't even load. What the heck did I do?
I ended up having to get out of my chair and physically power off HV-03. How 90s of me. Woah is it taking a extremely long time to come back up. Extra long for a Dell server. Did I kill the whole thing? I walk back to the server to see what it's doing and while walking from my desk to my server it completed booting up. That's something at this point!
DC-03 still shows as running, but it is acting weird, as all the virtual machines are. I try again to stop DC-03 but I can't even tell if the command is going through the system is that unresponsive. It seems like it is now stuck in the turning off status but I can't confirm that