A week before Ruxcon, I attempted to log into the virtual machines that we had setup. First one worked fine, the other two were hanging during the SSH process.
Out of curiousity, I did dmesg on the first one to see if that could give me any indications as to why the others were not responding.
And, it did. Unfortunately. dmesg indicated that there was hard drive errors. Due to not having correct backups in place, we had to attempt to recover the data from the hard drive. Enter ddrescue.
After the initial 40 gigabytes or so copying from the hard drive, the errors would show up again, and the machine had to be rebooted. Rinse, repeat, same results. Around that time, another Ruxcon staff member rang me, and suggested that I try the ddrescue reverse copy mode, and see if that works.
This allowed an extra 140 gigabytes to be recovered, with a total of 20 gigabytes give or take not able to be rescued. So, initial data recover done, time to assess the further damage. After using vmfs-fuse to mount the recovered data, I tried to copy the VMDK’s off the disk. All 3 gave IO errors at some point (due to VMFS corruption).
So I ddrescue’d the VMDK’s to get the data off those, used qemu-nbd to mount them. Initially I didn’t want to fsck anything to see what was available, but everything was a mess. After trying to use fsck on the various VMDK’s, I got mixed results.
One machine looked completely fine, except nothing where the levels should have been, and nothing in /lost+found. One machine had everything in /lost+found, in a variety of mixed places.
The last machine was mostly alright. My first impression was that everything was fine. I copied data off, and started to rebuild stuff, and… errors. Turns out there was a lot of files with just NULL’s in them, and some directories were missing. I was able to piece together a mostly up to date version from /lost+found, luckily.
All that effort could have been avoided had I used appropriate backups, and deployed directly to the VM’s as opposed to working on them.