What MTBF Really Means

A few weeks ago I commented to a college that the disks we use in XtremIO arrays have a Mean Time Between Failure (MTBF) as stated by the manufacturer of 2 million hours, but that we were actually seeing a much higher number in the field (over the 3 million hour mark last I looked).

Given that 2 million hours is 228 years, and the disks we have in the field are only at most 18 months old, he asked how we could know that our MTBF is higher than 228 years?

The answer is that it's all down to statistics, and not really measured in years, even though that's how it's normally expressed.

2 million hours is 228 years, which really means that you expect 1 out of every 228 disks to fail every year. The more correct way to put that is that they have an AFR (annual failure rate) of 1/228, or about 0.44%, but MTBF is easier to understand.

To confuse things even further, MTBF is normally only calculated over the first 5 years of life of the component, so it means that on average we expect 1 out of every 228 disks to fail each year for the first 5 years, and then possibly more than that in each subsequent year.

So to calculate the measured MTBF for a product you just work backwards - look at how many drives are installed and their current time installed, and see how many have failed. Convert that to a per year failure rate and you've got the AFR, and from there you can work out the MTBF.