Seth Woolley's Blog

Occasional Musings

google publishing research?(0)

I came across a couple papers on drive failures today.  I saw one by CMU's Parallel Data Lab co-authored by Garth Gibson.  The Paper won best paper at USENIX FAST '07.

Read it here:\

That's how you write a good paper.  I used to work at Gibson's company, Panasas.

Now, note that Google submitted a paper too!

Reading this paper was so boring.  I learned nothing but for one empirical fact:

Drive failures are correlated to low smart-reported temperature.

They then suggest that environmental temperature need not be too cold.

Before anybody buys that second conclusion line, I think they should actually control for a few more factors:

SMART temperature values aren't very reliable.  In fact, they vary quite a bit.  Anybody who's ever cold-booted a drive array and read the smart temps off each drive knows they vary a lot.

Now that we know their temperature readings weren't reliable and not backed up by secondary figures using something better than a cheap sensor, we can make a few observations:

  • yes, it's true that if you're reading smart parameters and see a low temp value, your drive might fail sooner.
  • but that's because the temperature might be quite underreported
  • and that can enable firmware hacks that turn on background verify at lower temperatures
  • which can lead to increased wear as well as increased error detection
  • and drive-overheating can't be detected by the drive firmware to initiate drive back-off.

Where I used to work, we used heat chambers to test drive firmware funkiness with temperature.  We disabled any firmware that tried to do background verify automatically, as that lowers performance.  Our software handled failures automatically and did its own background scrubbing based on real parity checks.

The crux is that google failed to use independent variables and failed to correlate their data with performance, which is controlled in firmware to correct overheating using data from the same sensor they're reading from.  It's simply not an indepedent variable anymore.

Secondly, drive failure at the beginning of the curve is under-represented.  They make no mention of initial OS installs or initial burn-in testing on hardware that their own operations processes have.

Perhaps they never had their operations people review the claim that drive birth defects are not significant on drive failure.  I know for a fact that birth defects are common, having worked on factory processes for a storage vendor that had burn-in testing and kept track of failure statistics.

So, before you listen to Google, study drives.  They appear to not know anything about actual drives, other than to try to read indirect indicators.

And remember, Google's an advertising agency.  What do they know about science?

Seth Woolley's Blog google hardware