Article written

  • on 24.09.2009
  • at 09:19 AM
  • by Seer

When Home Servers Go Tits-up! 2

I have spent quite some time on several occassions on here praising the awesomeness that is Windows Home Server. So it’s only fair that when the shit hits the fan, I spend a similar amount of time cursing the entire ancestoral history of everyone involved with Windows Home Server!
My machine is the fantastic (if terribly named) Tranquil PC T7-HSG with a 500gb hard drive of its own and an attached USB 1TB WD green power hard drive. This setup has been running around 9 months now without a single issue. I have not even needed to log onto the machine since I installed around 10 months ago as it quietly kept itself up to date and served my storage and backup needs. During that time we have had probably five or six power outages and each time its just been a case of switching it back on and it just resumed it’s duties faithfully without a murmur of discontent.
BUT, in the last week or two I have had some issues which have begun to snowball a little out of control. Occasionally I would request a file from it’s storage and it would not be available due to an “I/O error”. After a couple of days of assuming this was a new ‘feature’ of my new Snow Leopard installation, I did the unthinkable and actually went down to the workroom to physically check the machine!
What I noticed was that though the USB drive had its power light on, there seemed to be no disk activity. I cycled the power and low and behold the problem was fixed… for about 24 hours. I continued this monotonous pattern for a few days (just no time to investigate properly) until I started getting messages about file conflicts as well as the USB drive occassionally dissapearing from the storage pool. On more than one occasion I have seen messages about bad blocks and other ugly things such as extremely slow response on explorer.exe tasks such as browsing the filesystem whilst using remote desktop. Some something is seriously up with this hitherto flawless machine. Furthermore the backup service refuses to start and file transfer from the WHS machine to other machines is incredibly slow (
So I needed to make time and start to investigate the root cause. Scandisk on the larger drive seems to go extremely slowly, so slowly it hadnt got past the index scan in 8 hours (bear in mind only about 400gb of the 1tb is actually used). Getting a little panicky I thought I would start copying the REALLY important data such as photos and family videos off the machine, but this would fail on many of the files with the same “I/O error” as earlier. I unplugged the USB drive and attached it to another Windows 7 PC and tried to physically copy the files over from the drive… again more failure. Seeing as these files are so valuable I am using folder duplication for those shares (meaning a physical copy is kept on both the hard drives) which should mean if one drive is failed then I should be able to retrieve them from the other drive. So I tried to copy them directly from the 500gb internal drive…. failure!!
So at this stage I see the problem as being one of several things:
- One or both hard drives is failing
- One or both of the hard drives is corrupt
- The WHS db or ‘tombstones’ are messed up
- The actual files themselves really are corrupt and un-recoverable!
My best option right now according to the very knowledgable folks at the wegotserved forums is to try a server recovery. This is basically a reinstallation of WHS that doesn’t touch the data. In the process I will end up with a new db and all the tombstones will be rebuilt. This may be enough for me to get my important data off the machine. Then I can start using some disk utilities to try and establish the exact problems with the drives and replace them if necessary. I will embark on this project tonight and cross my fingers!
The thing that has angered me is that there is no early warning of these problems in any logs. OK I understand a total and sudden complete failure of a drive would be impossible to warn about, but this seems like a slow corruption which therefore should be detectable and thus I should have been receiving big flashy warnings on my connected machines for a few weeks so that recovering the data to other machines could be embarked on as early as possible.
I’ll keep you posted on my progress and the results in case you should hit a similar problem.

I have spent quite some time on several occassions on here praising the awesomeness that is Windows Home Server. So it’s only fair that when the shit hits the fan, I spend a similar amount of time cursing the entire ancestoral history of everyone involved with Windows Home Server!

My machine is the fantastic (if terribly named) Tranquil PC T7-HSG with a 500gb hard drive of its own and an attached USB 1TB WD green power hard drive. This setup has been running around 9 months now without a single issue. I have not even needed to log onto the machine since I installed around 10 months ago as it quietly kept itself up to date and served my storage and backup needs. During that time we have had probably five or six power outages and each time its just been a case of switching it back on and it just resumed it’s duties faithfully without a murmur of discontent.

BUT, in the last week or two I have had some issues which have begun to snowball a little out of control. Occasionally I would request a file from it’s storage and it would not be available due to an “I/O error”. After a couple of days of assuming this was a new ‘feature’ of my new Snow Leopard installation, I did the unthinkable and actually went down to the workroom to physically check the machine!

What I noticed was that though the USB drive had its power light on, there seemed to be no disk activity. I cycled the power and low and behold the problem was fixed… for about 24 hours. I continued this monotonous pattern for a few days (just no time to investigate properly) until I started getting messages about file conflicts as well as the USB drive occassionally dissapearing from the storage pool. On more than one occasion I have seen messages about bad blocks and other ugly things such as extremely slow response on explorer.exe tasks such as browsing the filesystem whilst using remote desktop. Some something is seriously up with this hitherto flawless machine. Furthermore the backup service refuses to start and file transfer from the WHS machine to other machines is incredibly slow (

So I needed to make time and start to investigate the root cause. Scandisk on the larger drive seems to go extremely slowly, so slowly it hadnt got past the index scan in 8 hours (bear in mind only about 400gb of the 1tb is actually used). Getting a little panicky I thought I would start copying the REALLY important data such as photos and family videos off the machine, but this would fail on many of the files with the same “I/O error” as earlier. I unplugged the USB drive and attached it to another Windows 7 PC and tried to physically copy the files over from the drive… again more failure. Seeing as these files are so valuable I am using folder duplication for those shares (meaning a physical copy is kept on both the hard drives) which should mean if one drive is failed then I should be able to retrieve them from the other drive. So I tried to copy them directly from the 500gb internal drive…. failure!!

So at this stage I see the problem as being one of several things:

  • One or both hard drives is failing
  • One or both of the hard drives is corrupt
  • The WHS db or ‘tombstones’ are messed up
  • The actual files themselves really are corrupt and un-recoverable!

My best option right now according to the very knowledgable folks at the wegotserved forums is to try a server recovery. This is basically a reinstallation of WHS that doesn’t touch the data. In the process I will end up with a new db and all the tombstones will be rebuilt. This may be enough for me to get my important data off the machine. Then I can start using some disk utilities to try and establish the exact problems with the drives and replace them if necessary. I will embark on this project tonight and cross my fingers!

The thing that has angered me is that there is no early warning of these problems in any logs. OK I understand a total and sudden complete failure of a drive would be impossible to warn about, but this seems like a slow corruption which therefore should be detectable and thus I should have been receiving big flashy warnings on my connected machines for a few weeks so that recovering the data to other machines could be embarked on as early as possible.

I’ll keep you posted on my progress and the results in case you should hit a similar problem. In the meantime I accept the prayers of any religious denomination that my files really are not kaput!

The Solution:

So I followed the advice of the forum residents at the wegotserved forums and performed the server reinstall which was an ugly mess. The server did reinstall, but couldn’t start the final configurations and scripts which kept crashing out when I did hack them into starting. This meant I was in potentially a worse situation than before the reinstall because the 1TB external drive was now no longer in the storage pool.

But this also ended up being a blessing in disguise as it forced my hand into removing the drives and attempting to do some real data salvage on them individually in Windows 7. I started with the internal drive and managed to copy the priority shares (Photos and Home Videos) without a problem. This is odd I thought, why was WHS unable to copy these files to another machine? But I was happy to have the main treasures recovered.

I then set to work on the rest of the shares such as the User shares, Software etc. I found that very little of these files actually resided on the internal drive, but I copied what was there without a problem. Then I plugged in the 1TB drive and the problem became crystal clear. This drive is broken, broken, broken! As soon as it was plugged in, Windows 7 popped up a big dialog box telling me that this drive was failing and should be backed up immediately. I attempted a scandisk /r and got nowhere in 16 hours (10% progress) and the drive was intermittently dissapearing from explorer. No doubt this drive has some serious surface damage thus the bad blocks and unreadable files. But why didn’t WHS tell me this?

It turned out that I was able to retrieve about 50% of the files from the 1TB drive, and those that I couldn’t retrieve were not so critical, so overall I was fairly happy with what I recovered. That drive (A Western Digital Green Power) will be RMA’d but the lessons learnt have been almost worth the hassle.

Lessons Learnt:

Folder duplication – It’s great! Without this feature enabled on my important shares I would have lost a lot of critical data. However WHS doesn’t use the feature very well itself because whilst the faulty disk was still a part of the storage pool and I was trying to copy files that were duplicated, it was obviously and consistenly trying to copy the versions on the 1TB drive and failing, when it had perfectly good copies on the internal drive. Why didn’t WHS automatically fall back to those files, and reduplicate them to the 1TB drive? Why did it always go to the copies that were on the corrupt drive?

S.M.A.R.T – Didn’t indicate any issues in either the BIOS report before the OS loads or in WHS, yet as soon as the faulty drive was plugged into the Windows 7 box, warnings galore in the form of the dialog that popped up and in the event logs. When I rebooted with the disk in the machine it was reported in the BIOS screens and Windows 7 insisted on scandisk before booting. Maybe this is a fault with the motherboard in the Tranquil T7, but still you would think this kind of monitoring would be a focus of a secure storage solution! If the S.M.A.R.T monitor in WHS wasn’t able to get a status it should warn me about that too!

Many Smaller Disks” is better – with the life saving folder duplication feature, spreading the risk across as many disks as possible is much more secure than storage on fewer large disks, even if it may mean retrieval will be more cumbersome.

Finally, in 20 years of working with computers I have never, ever had a single hard disk fail even years and years past their warranty and after a lot of bad treatment. This had lulled me into a false sense of security which very almost cost me a thousands of family pictures and gigabytes of home videos that could never be replaced! Perhaps I will look into some online backup solutions that plugin into WHS to provide a further offsite backup the really crucial stuff. This probably amounts to less than 150gb of the 1.5TB of data stored, prices can’t be that bad… can they? :)

No related posts.

subscribe to comments RSS

There are 2 comments for this post

  1. Dean says:

    Thanks for this post! I’ve been getting file conflict reports on my WHS for the past month and I’ve been ignoring them. My complacency stems from the fact that 1. My WHS has been running for the past 6 months and 2. It seems that my PC backups are still operating normally. I should really spend some time to investigate. Thanks again for the story, since it’s encouraged me to be more proactive. The file conflicts should be enough to worry me!

  2. Ron says:

    USB drives and WHS DON’T do well! I have had MAJOR issues and hours of torment. WHS like to random read the drives in tell they corrupt themselvs. Also many small files dont work well. I backed up video feeds that had over a million files and it always went bad. I have used internal drives and so far so good. Love to hate WHS.

Please, feel free to post your own comment

* these are required fields

Trigger Finger is powered by WordPress