WHS: A Series of Unfortunate Events
by Ryan Smith on March 13, 2008 12:00 AM EST- Posted in
- Ryan's Ramblings
When it rains, it pours, and sometimes you get hit by lightning too which will really ruin your day.
Since very late last year, Microsoft has been facing an issue with Windows Home Server where under certain conditions files on a server’s shares could become corrupt. The severity of the situation is pretty immense and the situation straightforward: nothing should be getting corrupt on a file server, otherwise it’s a pretty useless file server. Since the initial report Microsoft has been attempting to reproduce the issue in order to fix it, and finally this week they have announced that they have fully identified the problem, its causes, and what needs to be done to fix it.
It just about couldn’t get any worse.
Before we jump too far ahead, it’s probably important to quickly go over what technology makes Windows Home Server unique, as this relates directly to the culprit. As we discussed in our WHS preview, WHS uses a new Microsoft technology called Drive Extender that sits on top of the file system. Drive Extender is responsible for pooling together all the hard drives in a server to function as one contiguous drive, and at the same time provide redundancy through automatic duplication. In practice it’s JBOD + RAID 1 implemented above the file system rather than below it as in a real RAID implementation.
When a file is transferred to a WHS share (and the server has more than 1 hard drive), the Drive Extender software is responsible for relocating it and if it’s marked for duplication making sure there are an appropriate number of distributed copies of the file. How this is done gets rather gritty, but for now we’ll leave the explanation as being such that it’s a rather ingenious implementation of Microsoft’s shadow file technology, along with the spare file ability of NTFS manipulated in to using sparse files as a way to implement symbolic links on a file system that doesn’t (or rather didn’t at the time) support them.
With that out of the way, why is the situation so grim? We’ve been talking with our sources trying to find out what exactly is going on, with some success. What follows is a mixture of things we know for sure, along with some educated guessing based on what sources will tell us; no one has so far been willing to fully explain the technical details of the problem nor completely confirm our guesses, but with the information we have we believe our guesses to be correct.
We do know for sure that the problem is in the Drive Extender software, which would make sense given that the problem is unique to WHS. What this means is that the corruption issues only extend to situations where the Drive Extender software is used: file shares on a system with more than 1 hard drive. Client backups are unaffected in all cases, and systems with only 1 drive are unaffected because Drive Extender is not used. Furthermore the issue is only with writes and not reads, so reading data is safe.
Finally, the problem is only with so-called “incremental” file writes, that is rewriting part of an existing file or appending additional data to it (aka file edits); full file writes are also unaffected. This is why in Microsoft’s own notes on the matter, the only applications that trigger the corruption issue are applications that explicitly and frequently use incremental writes; Outlook for example uses a flat-file database to store mail and Microsoft’s photo gallery application rewrite the metadata of a file in certain situations. This is great news because it means anything that does a full file write is unaffected, such as copying a file to a share or any number of applications (e.g. notepad) that simply replace a file when saving rather than rewriting any part of it. The situation could have been worse for Microsoft if it affected full file writes too.
One thing we have been trying to ascertain is how many users are affected by this corruption bug, with some success. A number of WHS-powered servers that have been sold are the lower-end models with only a single drive, automatically disqualifying/saving them from having issues. Far fewer servers use multiple drives, and then there are an unknown number of OEM copies of the OS sold to enthusiasts with unknown configurations. On systems that are susceptible to the issue, Microsoft seems confident that only a handful of them have or ever will experience the corruption issue, and based on the flaw that causes it this seems a safe bet. Our best guess is that among historic blunders the percentage of users affected is similar to that of the Pentium FDIV bug; far too many to be comfortable but few enough that most people will likely never notice the issue.
So if the estimated number of victims is so low and the problem confined to file edits, why do we still say the situation couldn’t get any worse? Because the flaw is not a bug in the code, it’s a fundamental flaw in the algorithms used by Drive Extender. We have with particular interest been trying to track down a precise explanation of what causes the corruption problem and this is where the guessing starts to come in.
The lynchpin in the problem is the Drive Extender technology, the DEmigrator process in particular. Something is going wrong when an incremental write is being made to a file that DEmigrator is already writing to, resulting in garbage being written instead. At first we suspected it was an issue with how WHS handles file locks, but upon further review this seems unlikely. Instead there appears to be some kind of race condition with the file write itself, with the condition resulting in somehow both writes simultaneously making it to the OS write cache, and subsequently the bad cache is being written to disk. This explanation would fully account for the low incidences of the problem (DEmigrator already needs to have the victim file open for writing) and Microsoft’s troubles identifying the problem.
The worst part of this however is that this race condition is a fundamental flaw in the Drive Extender software, it’s not simply a bug where someone typed the wrong thing in for a line of code. Microsoft does not completely explain how their Drive Extender technology works so we do not have any inner details on what these algorithms are, but a race condition is particularly worrisome because they’re one of the hardest issues in computer science to account for and correct. To put things in perspective, we were have the following analogy:
If you’re familiar with the concept of sorting in CompSci then you should be familiar with the notion of sort stability. If data has previously been sorted by another field, a new sort with a stable sort will maintain that order, an unstable sort will not. Ultimately if you are using an unstable sort you can not just modify a sort to be stable, you have to replace it with a stable sort. This is the kind of issue Microsoft is facing.
The faulty algorithm in the Drive Extender software is effectively an unstable sort. The WHS development team has to completely rewrite part of the Drive Extender software to use new, safe algorithms that will not suffer from this assumed race condition. This is what makes the situation so grim, there is little worse than having to completely rewrite a piece of shipping software to fix a bug, both for issues of morale and time.
This kind of rewrite takes time to accomplish and the QA process takes even more time. Microsoft has estimated that the fix will not be ready for another 3 months (June) and while it’s unlikely for this to take less time it can certainly take more if the QA process finds more problems. What little good news that is here is that the dev team has already identified a possible fix and has started testing it, so it is in fact possible to correct the issue and the dev team has a good enough understanding of the issue to create a fix.
Until that fix arrives however, this puts WHS in a perilous position. As a v1 product breaking new ground WHS already has plenty of challenges. The media will eat this up (and we’re just as guilty) and this will tarnish the product’s name for the rest of its life; customers don’t need to understand an issue to understand that a product is imperfect and that they should stay away from it. Yet data corruption is a serious issue that isn’t acceptable and can’t be ignored.
Perhaps the worst bit however is that as an OEM-only product, Microsoft is not exerting any real control over what the OEMs do about the issue until the corruption problem is fixed. As of right now retailers are still selling OEM servers with 2+ drives (making them susceptible to the bug) and computer enthusiast retailers are still selling the OS itself, all with no notice about this bug. WHS is a good product where plenty of functionality can still be used even with the presence of the bug (e.g. backups) but we have serious problems with it still being offered for sale given these problems. WHS is already heavily tarnished due to this bug, there’s no (okay, some) shame in cutting ones losses and halting all sales of the OS until the bug is fixed, even if it won’t affect most users.
Ultimately it’s a damn shame to see something like this happen, no one is going to be a winner. Windows Home Server will be fixed, but only after a lot of grief for the developers and a lot of concern for server owners. Thankfully current server owners can take steps to prevent the corruption issue entirely, but at a cost of functionality, and we don’t doubt some people will still feel insecure about their data even after taking those steps. For the time being WHS is dead in the water, it’s a promising product that is not suitable for further sale given the potential severity of the bug. It also undermines a great deal of confidence in Microsoft that will take some time to recover.
Finally this also brings in to further question just how long of a shelf life WHS will have anyhow. It’s a poorly kept secret that Microsoft is already working on the next version based on the Longhorn kernel (Drive Extender practically begs for prioritized I/O) and we have always expected it to come fairly soon once Windows Server 2008 was completed. WHS v1 may be done for entirely, even once Microsoft fixes the corruption issue and continues/resumes selling it. If by the time the issue is fixed we’re looking at less than a year before the next WHS, it may not be worth buying WHS v1 at all. Then again Microsoft is undoubtedly carrying much of the WHS technology and software forward for the next version, this bug could very well push WHS v2 back if the bug was getting carried forward too.
17 Comments
View All Comments
Itsamuppett - Thursday, April 17, 2008 - link
So close, and yet so far.I bought an OEM copy as soon as it came out. I re-purposed my old gaming machine platform and built it into an HTPC case using quiet fans etc. AMD 64x2 4400+, DFI RD200 MoBo, 3x Samsung 500Gb, DVDRam, 2Gb OCZ400.
I had a big fight on setup as I wanted to use the DFI hardware RAID and everytime in order to RAID5 thr 3 500Gb discs. I configured it and WHS refused to identify it on installation. If I loadad XP or Vista it would install over the top without a fight, but WHS would not see it. I had to unbundle and load the core OS as JBOD.
As many people say, good n00b friendly interface, easy setup, backup and recovery great feature as there are a lot of laptops around the hose now a days. Remote access and domain hosting very easy to setup and very valuable as a I travel a lot (Thank you slingbox as well)
JBOD and "application level software RAID" became annoying as I copied all the local content over to the duplicated folders on the server. The copy speed sucked and the number of times I had to see if the thing was still alive (I thought it had detacthed and gone to sleep) and found "storage balancing" on the bottom line began to make me frustrated. This thing has enough horsepower and the statement of MS is that it should be capable of running on older retired hardware.
Then it all went furry. I backup a Dell 1710 and 1330 as our two house machines which have important material. We needed to restore the 1330 in order to get a file that was overwritten and it failed badly. I started to look at the web and found an excellent site
http://www.wegotserved.co.uk/">http://www.wegotserved.co.uk/
this explained a lot of the issue you very concisely cover in your article. So I looked at a number of my other files as a test (170Gb of personal music uploaded through WMP) and I had added comments and rankings to them. Quite a lot of errors and bad files.
I have continued to use the system, the utility that allows you to prune it back to a single drive and de-mount the other drives works well. I dropped it to a single 500Gb and added the others to the new gaming system as a ICH-9R RAID array and copied all of the files over. It is now the default house server using VistaU64. Lets just agree is is faster, easier and more effective. It does not have the remote access and backup but that costs about £20 in software to add ?
I am disgusted by MS stance, in general I think they do a pretty good job and whilst it is easy to knock them (I was an Apple field engineer when the Mac was released and worked for Sun for 3 years) they do deserve respect and I can think of companies that do a lot worse.
So you avoid telling us. You continue to ship the thing. You don't assist your OEM (unless it was on the quiet.....) You don't offer any form of guidance. When the customer does find out he has to dig deep to find out the reality and then gets the wake up call.
I have a software package that does not work. I remember the old joke about MS taking the mickey out of GM for the slow rate of development of the motor car and then GM firing back about the BSOD issue at 90. (lets not mention the work of Mr Nader on an early example) MS just provided a swing axle operating system at this point and do deserve an attitude adjustment.
I am not saying litigation is the answer. I would like to either return the software or be given an SME S2008 licence, they offered me a home server and I don't have one that is trustworthy. If you want to regain respect in a situation like this you should support the customer and offer a positive commitment to improvement. I think a promise to provide V2, an interim "offer" of an alternative if the cutomer does want to progress it (admittedly the OEM version would probably take this but the pre-builts would have problems) would be a decent gesture.
I did recommend it to many of my friends - when will I learn.
Come on MS pull your finger out and show some respect to people who pay your wages.
WT - Monday, March 17, 2008 - link
OfficeMax was selling WHS with a free 500gbHD, but it looks like the issue is occurring with 2 HD setups, so that free HD would not be a wise addition to the WHS until this bug is corrected (June ?? C'mon MS !!)Either way, WHS itself is a great idea and the software itself is what makes it powerful, but the users afflicted with this bug are suffering with no end in sight.
cbutters - Sunday, March 16, 2008 - link
Great article, I'm glad I'm not alone seeing this corruption.I commented about this corruption bug in my article I wrote a few weeks ago on eXoid.com I can't believe this wasn't caught earlier, You would think this would be a lawsuit waiting to happen.
ajdavis - Sunday, March 16, 2008 - link
You mean a lawsuit like hard drive manufacturers face? Computers are expected to, at some point, lose/corrupt/completely obliterate data. Users are expected to have backups. Period.FireTech - Sunday, March 16, 2008 - link
Very nice article Ryan, which I'm sure now makes the situation seem a lot less 'scary' for the majority of current WHS users.ianken - Friday, March 14, 2008 - link
...I have yet to hit this, been running since beta.Backup - no problems
Restore - no problems
RDP gateway/Remote - no problems
Serviing files - no problems
Copying content to the server - no problems
I never edit on the server, even before this was identified.
The fact that the WHS is being so open about this and up front about how hard it is to address is refreshing.
As to drive extender being "overly complicated." Well, it does cool things. Online, on the fly storage aggregation without having to suffer long array build times is not simple. That said, I disable all file duplication and rely on an Areca controller and RAID5 for recovery. File mirroring is a waste of space when you have RAID5 and 6 as viable options.
To be honest my biggest beef: you cannot boot a client off the restore CD if the optical drive is SATA, even if a SATA drive is plugged in and you;re booting from a PATA device it will crash. Lame.
However the painless backup/restore process is awesome.
Dsjonz - Sunday, March 16, 2008 - link
My experience matches yours, except for this:"To be honest my biggest beef: you cannot boot a client off the restore CD if the optical drive is SATA, even if a SATA drive is plugged in and you;re booting from a PATA device it will crash. Lame."
I boot one of my client PCs with the WHS restore disk from a SATA DVD-RW with no problem.
System: LG GSA-H26N DVD-RW, Gigabyte P35-DQ6, QX9650, Vista Ultimate
gpaul - Saturday, March 15, 2008 - link
I was at first very concerned about this issue. Then I came to realize that due to the nightly backup most of the files I had always kept on my W3K server so they where safe were now just as safe on my WS. The only reason in a home environment to have a file 'shared' on the server is for multi-user live access needs. I don't know of any in my environment. I keep everything on the WS and let the WHS handle my long term archive, mp3, wav, and backup needs. It does it well. I wish this bug didn't exist but I also know it will be fixed and in the meantime I'm not impacted by it.TheBeagle - Friday, March 14, 2008 - link
This whole unfortunate situation with WHS reminds me a the TV advertising of Paul Masson (you know, the wine guy) a few years ago. They used to run an ad that said, "We will sell no wine before its time!" Indeed they even got none other than Orson Welles to do that TV piece. I sure wish the WHS folks at Microsoft had paid a little better attention to that ad.Of all the shortcomings that a new software server product might encounter, data corruption is by far the worst. That type of a bug just destroys public confidence in a product, a large portion of which will likely never be fully regained. It also made a fool out of a number of computer pundits that unequivocally endorsed the WHS product.
Now just in case anyone wonders if I know what I'm talking about, I run a computer business and had planned on selling the heck out of WHS. In fact I still hope to do so. But NOT with WHS Rev.1. It is a plain fact of life that we desperately need a re-born, distinctly new version (Rev 2.0) to sell (and use) in order to overcome the terrible effects of this data corruption fiasco.
Oh, and by the way Microsoft, if you have a collective brain in your head, you darn well ought to do the right thing by offering a free upgrade to Rev.2 for anyone who bought Rev.1. At least that way the significant number of early adopters won't become a mortal enemy to WHS and kill the thing before it ever gets a chance to resurrect itself. That's just some food for thought. But you sure ought to give it some serious consideration.
Ares2600 - Friday, March 14, 2008 - link
I agree race conditions are pretty rough, but it feels like this is a little over-dramatic. Using the sort example is kind of a worst case scenario type thing. Generally they can be solved with some additional synchronization (and hopefully preserve performance). Lock up a resource rather than fight for it (hence the 'race' metaphor).Since it hits so close to home with respect to the 'great new features' category of WHS I'm sure the dev responsible won't be in a great position, but playing the odds my guess is it will be handled in short order.