Re: SMART disk monitoring

Sage Weil <sweil@xxxxxxxxxx> · Tue, 14 Nov 2017 14:19:45 +0000 (UTC)

On Tue, 14 Nov 2017, Piotr Dałek wrote:
> On 17-11-14 05:09 AM, Ric Wheeler wrote:
> > On 11/13/2017 05:23 PM, Piotr Dałek wrote:
> > > This may not work anyway, because  many controllers (including JBOD
> > > controllers) don't pass-through SMART data, or the data don't make sense.
> > 
> > You are right that many controllers don't pass this information without
> > going through their non-open source tools. The libstoragemgmt project -
> > https://github.com/libstorage/libstoragemgmt - has added support for doing
> > some types of access for the physical back end drives. It is worth syncing
> > up with them I think to see how we might be able to extract interesting
> > bits.
> 
> There's another problem - bcache/flashcache/<insert your favorite vendor>
> cache - osds often reside on top of some cache device, and accessing SMART
> values for that might not work, or might not return all required values.

For devicemapper devices at least it is pretty straightforward to work out 
the underlying physical device.

I'm sure there will always be some devices and stacks that successfully 
obscure the reliablity data, but most deployments will benefit.

> > There is also a lot of information about drive failures (SSD and spinning)
> > at USENIX FAST over many years. Things have improved a lot over the years,
> > especially with modern SSD's and NVME where a lot of hard work has happened
> > to add improved metrics to the data.
> 
> That's my point. That's a lot of statistics to chew through and most of it
> relies on assumptions that can be already wrong or be wrong some time after.
> All it takes is a brand-new product line with different characteristics.
> SSDs are different - you just measure number of erase/program cycles and
> (again) do assumptions based on that - that's easier and more reliable.
> Still, I would be *very* unhappy if I'd be woken up in the middle of the night
> just to realize that cluster incorrectly predicted disk failure and my company
> (and I'm pretty sure not only my company) wouldn't be happy either if cluster
> would force it to throw away perfectly good disks, because reusing them would
> yield the same result.
> On the other hand, this creates a back door for vendors to force device
> replacement even if it's perfectly fine, some SSD vendors already do this with
> their devices going into read-only mode even when there's a whole lot of p/e
> cycles left in flash cells. I don't think we need Ceph to go this way.

OT: I view building good prediction models as an orthogonal problem, and 
one that relies on collecting a large data set.  Patrick McGarry and 
several others are working on a related project to build a public data set 
of SMART etc reliability data so that such models can be built for use in 
open systems.  Current data sets from backblaze suffer from a small set of 
device models, which means only large cloud providers or system vendor 
with large deployments are able to gather enough healthy metrics and 
failure data to build good models.  The goal of the other project is to 
allow regular users (of systmes like Ceph) to opt into sharing reliability 
data so that better models can be built--ones that cover a broader range 
of devices.

> tl;dr - I'm fine with that feature as long as there'll be a possibility to
> disable it entirely.

Of course!

sage