On 11/13/2017 05:23 PM, Piotr Dałek wrote:
On 17-11-12 09:16 PM, Sage Weil wrote:
On Sun, 12 Nov 2017, Lars Marowsky-Bree wrote:
On 2017-11-10T22:36:46, Yaarit Hatuka <yaarit@xxxxxxxxx> wrote:
Many thanks! I'm very excited to join Ceph's outstanding community!
I'm looking forward to working on this challenging project, and I'm
very grateful for the opportunity to be guided by Sage.
That's all excellent news!
Can we discuss though if/how this belongs into ceph-osd? Given that this
can (and is) already collected via smartmon, either via prometheus or, I
assume, collectd as well? Does this really need to be added to the OSD
code?
Would the goal be for them to report this to ceph-mgr, or expose
directly as something to be queried via, say, a prometheus exporter
binding? Or are the OSDs supposed to directly act on this information?
The OSD is just a convenient channel, but needn't be the only
one or only option.
Part 1 of the project is to get JSON output out of smartctl so we avoid
one of the many crufty projects floating around to parse its weird output;
that'll be helpful all consumers, presumably.
That means a new patch to smartctl itself, right?
Part 2 is to map OSDs to host:device pairs; that merged already.
Part 3 is to gather the actual data. The prototype has the OSD polling
this because it (1) knows which devices it consumes and (2) is present on
every node. We're contemplating a per-host ceph-volume-agent for
assisting with OSD (de)provisioning (i.e., running ceph-volume); that
could be an option. Of if some other tool is already scraping it and can
be queried, that would work too.
I think the OSD will end up being a necessary path (perhaps among many),
though, because when we are using SPDK I don't think we'll be able to get
the SMART data via smartctl (or any other tool) at all because the OSD
process will be running the NVMe driver.
This may not work anyway, because many controllers (including JBOD
controllers) don't pass-through SMART data, or the data don't make sense.
You are right that many controllers don't pass this information without going
through their non-open source tools. The libstoragemgmt project -
https://github.com/libstorage/libstoragemgmt - has added support for doing some
types of access for the physical back end drives. It is worth syncing up with
them I think to see how we might be able to extract interesting bits.
Part 4 is to archive the results. The original thought was to dump it
into RADOS. I hadn't considered prometheus, but that might be a better
fit! I'm generally pretty cautious about introducing dependencies like
this but we're already expecting prometheus to be used for other metrics
for the dashboard. I'm not sure whether prometheus' query interface lends
itself to the failure models, though...
Part 5 is to do some basic failure prediction!
SMART is unreliable on spinning disks, and on SSDs it's only as reliable as
firmware goes (and that is often questionable).
Also, many vendors give different meaning to different SMART attributes,
making some of obvious choices (like power-on hours or power-cycle count)
useless (see https://www.backblaze.com/blog/hard-drive-smart-stats/ for example).
SMART data has been used selectively by major storage vendors for years to help
flag errors. For spinning drives, one traditional red flag was the number of
reallocated sectors (normalized by the number a spinning drive has). When you
start chewing through those, that is a pretty good flag. Seagate and others did
a lot of work (and models) that turned smart data into a good predictor for
failure on spinning drives, but it is not entirely trivial to do.
For example, at the USENIX Vault conference, there was this presentation which
showed some interesting recent work:
http://sched.co/9WQT
There is also a lot of information about drive failures (SSD and spinning) at
USENIX FAST over many years. Things have improved a lot over the years,
especially with modern SSD's and NVME where a lot of hard work has happened to
add improved metrics to the data.
Regards,
Ric
Anyway, we'd love to see that this feature can be completely disabled by
config change and don't incur any backwards incompatibility by itself.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html