I have spent quite some time on several occassions on here praising the awesomeness that is Windows Home Server. So it’s only fair that when the shit hits the fan, I spend a similar amount of time cursing the entire ancestoral history of everyone involved with Windows Home Server!
My machine is the fantastic (if terribly named) Tranquil PC T7-HSG with a 500gb hard drive of its own and an attached USB 1TB WD green power hard drive. This setup has been running around 9 months now without a single issue. I have not even needed to log onto the machine since I installed around 10 months ago as it quietly kept itself up to date and served my storage and backup needs. During that time we have had probably five or six power outages and each time its just been a case of switching it back on and it just resumed it’s duties faithfully without a murmur of discontent.
BUT, in the last week or two I have had some issues which have begun to snowball a little out of control. Occasionally I would request a file from it’s storage and it would not be available due to an “I/O error”. After a couple of days of assuming this was a new ‘feature’ of my new Snow Leopard installation, I did the unthinkable and actually went down to the workroom to physically check the machine!
What I noticed was that though the USB drive had its power light on, there seemed to be no disk activity. I cycled the power and low and behold the problem was fixed… for about 24 hours. I continued this monotonous pattern for a few days (just no time to investigate properly) until I started getting messages about file conflicts as well as the USB drive occassionally dissapearing from the storage pool. On more than one occasion I have seen messages about bad blocks and other ugly things such as extremely slow response on explorer.exe tasks such as browsing the filesystem whilst using remote desktop. Some something is seriously up with this hitherto flawless machine. Furthermore the backup service refuses to start and file transfer from the WHS machine to other machines is incredibly slow (
So I needed to make time and start to investigate the root cause. Scandisk on the larger drive seems to go extremely slowly, so slowly it hadnt got past the index scan in 8 hours (bear in mind only about 400gb of the 1tb is actually used). Getting a little panicky I thought I would start copying the REALLY important data such as photos and family videos off the machine, but this would fail on many of the files with the same “I/O error” as earlier. I unplugged the USB drive and attached it to another Windows 7 PC and tried to physically copy the files over from the drive… again more failure. Seeing as these files are so valuable I am using folder duplication for those shares (meaning a physical copy is kept on both the hard drives) which should mean if one drive is failed then I should be able to retrieve them from the other drive. So I tried to copy them directly from the 500gb internal drive…. failure!!
So at this stage I see the problem as being one of several things:
- One or both hard drives is failing
- One or both of the hard drives is corrupt
- The WHS db or ‘tombstones’ are messed up
- The actual files themselves really are corrupt and un-recoverable!
My best option right now according to the very knowledgable folks at the wegotserved forums is to try a server recovery. This is basically a reinstallation of WHS that doesn’t touch the data. In the process I will end up with a new db and all the tombstones will be rebuilt. This may be enough for me to get my important data off the machine. Then I can start using some disk utilities to try and establish the exact problems with the drives and replace them if necessary. I will embark on this project tonight and cross my fingers!
The thing that has angered me is that there is no early warning of these problems in any logs. OK I understand a total and sudden complete failure of a drive would be impossible to warn about, but this seems like a slow corruption which therefore should be detectable and thus I should have been receiving big flashy warnings on my connected machines for a few weeks so that recovering the data to other machines could be embarked on as early as possible.
I’ll keep you posted on my progress and the results in case you should hit a similar problem.
I have spent quite some time on several occassions on here praising the awesomeness that is Windows Home Server. So it’s only fair that when the shit hits the fan, I spend a similar amount of time cursing the entire ancestoral history of everyone involved with Windows Home Server! continue reading »