Re: SMART disk monitoring

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 13 Nov 2017 13:31:52 +0000 (UTC)



On Mon, 13 Nov 2017, Sage Weil wrote:
> On Mon, 13 Nov 2017, Lars Marowsky-Bree wrote:
> > On 2017-11-13T10:46:25, John Spray <jspray@xxxxxxxxxx> wrote:
> > 
> > > At the risk of stretching the analogy to breaking point, when we build
> > > something "batteries included", it doesn't mean someone can't also
> > > plug it into a mains power supply :-)
> > 
> > Plugging something designed to take 2x AAA cells into a mains power
> > supply is usually considered a bad idea, though ;-)
> > 
> > > My attitude to prometheus is that we should use it (a lot! I'm a total
> > > fan boy) but that it isn't an exclusive relationship: plug prometheus
> > > into Ceph and you get the histories of things, but without prometheus
> > > you should still be able to see all the latest values.
> > 
> > That makes sense, of course. Prometheus scrapes values from various
> > sources, and if it could scrape data directly off the ceph-osd
> > processes, why not.
> > 
> > > In that context, I would wonder if it would be better to initially do
> > > the SMART work with just latest values (for just latest vals we could
> > > persist these in config keys), and any history-based failure
> > > prediction would perhaps depend on the user having a prometheus server
> > > to store the history?
> > 
> > That isn't a bad idea, but would you really want to persist this in a
> > (potentially rather large) map? That'd involve relaying them to the MONs
> > or mgr.
> > 
> > Wouldn't it make more sense for something that wants to look at this
> > data to contact the relevant daemon? It exposing the data also in the
> > Prometheus exporter format would be useful (so they can directly be
> > ingested), of course.
> 
> The decision should about preemptive failure should be made by the mgr 
> module regardless (so it can consider other factors, like cluster 
> health and fullness), so if it's not getting the raw data to apply the 
> model it needs to get a sufficiently meaningful metric (e.g., 
> precision/recall curve or area under precisiosn-recall curve [1]).

[1] http://events.linuxfoundation.org/sites/events/files/slides/LF-Vault-2017-aelshimi.pdf


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html