Double Disk Failure
If you've worked in IT - especially storage - long enough then you've probably heard the phrase "Double-Disk Failure", referring to the loss of 2 (or more) disks in a redundant RAID array.
The two most common forms of raid - RAID-1 and RAID-5 - are designed to be able to handle the failure of a single disk, and then run in a degraded (ie, unprotected) state for a short window until the disk is replaced. If a second disk "fails" during that window, then it potentially means the loss of all of the data on the array.
During my time working at Sun Microsystems I was involved in assisting with a large number of double-disk failures for our customers - not because it's something that happens frequently, but because I was one of the poeple called in when they did happen. During all of that time, I don't recall seeing a single situation of what I would technically call a case of two disks "failing" within a short period of time. Instead, all of the double-disk failures I was involved with fell into one of two categories.
-
One disk fails, followed by another disk failing several weeks/months/years later. This is, quite simply, either a failure of monitoring and/or a failure in process. Having a system run with a failed disk for any more than a day is simply bad practice. Having one running for over a year (and yes, I have seen that multiple times!) is a major failing of either the staff and/or the systems monitoring the array, and not something that can really be considered a fault of RAID.
-
One disk fails, and during rebuild a read error occurs on the redundant copy of the data. This is by far the most common form of double-disk failure, and there are many ways to avoid it occurring, and normally ways to recover from it without data loss.
To understand how double-disk failures occur, we first need to understand a little more about error detection and handing in RAID systems.
Hard disks store CRC (Cyclic Redundancy Check) data on the disk along with the data they are storing. CRC is designed to allow the system to detect if the data has been modified, but other than indicating an error it does not allow the system to recover the data - it simply flags that something is wrong. Data on disk can degrade over time, especially through a process called "bit rot" where a single bit of the data can flip from a 1 to a 0, in effect resulting in the entire data block being corrupted. CRC will detect this corruption the next time the block is read, and instead of returning the data to the host it will return a "Read Error" for that block.
Upon receiving a read error, the host can do one of a number of things. If the disk the error occurred on is not redundant (ie, no RAID, or RAID-0) then it has no choice but to fail the read, report the error, and give up! In most cases people would (rather needlessly) consider this a failed disk and start the process of having it replaced - and at the same time start looking for last nights backup tapes!
If the disk is protected (RAID-1, RAID-5, etc) then the system can rebuild the errored data - either by reading the other side of the mirror (RAID-1), or reading the other data in the stripe plus the parity data, and then rebuilding the errored block (RAID-5, RAID-6). What happens next depend on the RAID software in use. Some systems will simply consider the disk that returned the read error as "failed", and offline it - leaving the good copy of the data unprotected until the disk is replaced (or a hot spare kicks in, etc). Others systems will take the good copy of the data they have now obtained, and re-write it over the "failed" block - in effect, correcting the error on that disk. Behind the scenes, the disk itself will often carry out a re-map operation - marking the part of the disk that experienced the error as bad, and instead using one of a number of spare blocks reserved for exactly this purpose.
In general, the latter option is the better one. It results in a much faster (basically immediate) recovery, requires no hardware to be swapped, and significantly reduces the chances of data loss. Over time the number of re-allocations that occurs can be monitored, and if the disk detects an excessive number of them, or if it runs out of re-allocation space, then it will report an error based on that and the disk can be replaced.
In all cases, the issue that triggers a CRC error actually occurs before the data is read. There may have been a 'bit flip' months ago, but the error will remain undetected until the host actually attempts to read the data. This type of fault is normally refereed to as a "latent data error" - the error can remain undetected for months or even years, especially if it is on a part of the disk that doesn't actually contain any host-level data, and may not have ever actually been written to by the host.
Now, back to double-disk failures.
Consider a simple 2-disk RAID-1 mirror. Both of the disks have been lightly used, so only about 1/4 of the space on the disks is actually in use by data. Over time, both disks have had single block go bad due to bit rot - on the first disk, this is within the area of space that is in use, whilst on the second disk it's in the 3/4 of the disk that does not contain any system-level data.
The host attempts to read the block which is bad on the first disk, and then disk generates a read error. The system detects this error, and carries out the read of the data from the second disk which obviously succeeds. But what happens next? Let’s say we've got a bad RAID implementation, and the RAID software simply marks the first disk as failed without any attempts to re-write the bad block. The system reports the error, which is (hopefully!) detected by the sysadmin, who arranges for the disk to be replaced.
Once the new disk is installed, the RAID software kicks off the re-mirroring process, which involves reading every byte of data from the still functioning disk, and writing to the new disk. As the RAID software works at the disk level, it has no way of knowing that only 1/4 of the disk actually contains data, so it sets about copying the entire disk - which works well right up until it hits a latent data fault on the second disk. At this point, we have a double-disk failure. The rebuild of the (new) first disk is incomplete, and the second disk is now reporting an error which is stopping the rebuild. Our problem wasn't that the two bad blocks occurred within a short time of each other - they could have been lying in wait for years waiting to be read. Instead, it's actually the process of doing the rebuild that has caused the second disk to "fail" from a pre-existing fault.
Recovering from a double-disk failure
In many cases, it's actually very possible to recover from a double disk failure. For the situation above, the rebuild will fail, but the system will generally stay fully up and running. There is a very good chance that the error actually occurred on a part of the disk that doesn't contain data. Why? Because if it was on a part that contained data then that block would have been read and the error detected earlier, either during general system operation or during a (full) backup that will read every piece of data on the system. (This is not necessarily the case with RAID-5, as the parity data is not read during normal read operations)
At this point, the data can be migrated at the filesystem level to another disk on the system or to another system, or a backup can be taken to allow recovery after the system has been fixed. As a data-based copy will not attempt to access the unused failed block, this copy will most likely complete without error. At worst, it will likely fail for only a single file (the one containing the errored block).
Preventing double-disk failures
There's no way to completely protect yourself against double-disk failures, although there are varying levels of things you can do to help prevent them.
-
Monitor! Monitor! Monitor! If a disk fails and you don't detect it, it's only a matter of time until its partner will go as well. Having insufficient (or no!) monitoring of disk failures will all but guarantee that there's a "double-disk failure" in your future. Ideally monitor in multiple ways - watch log files for immediate notification of errors, but also schedule regular checks of the status of your RAID system - if it's not healthy, raise the alert!
-
Use RAID software that attempts a re-write after a read error (and/or enable this feature if it's disabled). Not only will this reduce the potential number of double-disk failures, it'll also reduce the number of single disk failures that need to be replaced.
-
Enable periodic checking of the data in your RAID array. Many RAID systems/software allow you to schedule a regular check of the data on your array. All data (including both copies of the data, parity data, and even data that is unused) will be read, and if any errors are detected they will normally be automatically corrected.
-
Don't be lazy when creating RAID! Some systems allow you to create a new RAID volume (in particular, a RAID-1 volume) without actually mirroring the initial data - with the logic being that the volume is starting out "blank" so it doesn't matter if the data doesn't match as it will always be written to (syncing the two sides of the mirror) before it's read. Whilst this may be a valid choice to make, the initial mirroring has a secondary benefit of forcing a full read and/or write of the data on both disks - by skipping this step you're losing the opportunity for the system to detect (and correct!) any existing latent data errors on either of the disks.
-
Use RAID-6. RAID-6 avoids double-disk failures by having dual parity available, and thus can handle double disk failure situations - although how it does this is different to how most people suspect. In most cases a double-disk failure due to a latent data error with RAID-6 will not cause it to offline the second failed disk and then continue using the additional redundancy. Instead it will use the additional parity information to correct the errored block, and then continue on with the rebuild. The distinction is important, because as it means that you can experience the symptoms described above when using RAID-6 and not even notice that it has occurred - the expected behavior is not that a second disk will be offlined during the rebuild, but that it will (often silently) correct the error and continue. I've had people tell me that RAID-6 has never saved them as they have never had 2 disks fail in a RAID-6 system at once - but the simple fact is that it could have saved them and they simply didn't realize.
-
Use Flash! At the cell level, flash is actually MORE susceptible to bit rot than magnetic hard disks, however in order to handle this the CRC checking that is used in hard disks has been replaced with ECC - Error Checking and Correcting - in flash media. ECC allows the drive not only to detect bit errors (as CRC does), but includes enough redundancy to be able to not only correct single bit errors, but also many types of multi-bit errors. Thus even though bit rot is more likely to occur at the cell level, it is almost non-existent at higher levels due to ECC. That, along with the generally higher reliability of Flash/SSD due to no moving parts and lower operating heat results in an almost zero chance of double-disk failures with Flash.