RE: Suggestions on tracker 13578

Paul Von-Stamwitz <PVonStamwitz@xxxxxxxxxxxxxx> · Wed, 2 Dec 2015 11:54:30 -0800

> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
> Sent: Wednesday, December 02, 2015 11:04 AM
> To: Gregory Farnum; Vimal
> Cc: ceph-devel
> Subject: Re: Suggestions on tracker 13578
> 
> 
> On 12/02/2015 12:23 PM, Gregory Farnum wrote:
> > On Tue, Dec 1, 2015 at 5:23 AM, Vimal <vikumar@xxxxxxxxxx> wrote:
> >> Hello,
> >>
> >> This mail is to discuss the feature request at
> >> http://tracker.ceph.com/issues/13578.
> >>
> >> If done, such a tool should help point out several mis-configurations
> >> that may cause problems in a cluster later.
> >>
> >> Some of the suggestions are:
> >>
> >> a) A check to understand if the MONs and OSD nodes are on the same
> machines.
> >>
> >> b) If /var is a separate partition or not, to prevent the root
> >> filesystem from being filled up.
> >>
> >> c) If monitors are deployed in different failure domains or not.
> >>
> >> d) If the OSDs are deployed in different failure domains.
> >>
> >> e) If a journal disk is used for more than six OSDs. Right now, the
> >> documentation suggests upto 6 OSD journals to exist on a single
> >> journal disk.
> >>
> >> f) Failure domains depending on the power source.
> >>
> >> There can be several more checks, and it can be a useful tool to test
> >> the problems an existing cluster or a new installation.
> >>
> >> But I'd like to know how the engineering community sees this, if its
> >> seems to be worth pursuing, and what suggestions do you have for
> >> improving/adding to this.
> >
> > This is a user experience and support tool; I don't think the
> > engineering community can really judge its value. ;)
> >
> > So sure, sounds good to me. It'll need to get into the hands of users
> > before we find out if it's a good plan or not. I was at the SDI Summit
> > yesterday and was hearing about how some of our choices (like
> > HEALTH_WARN on pg counts) are *really* scary for users who think
> > they're in danger of losing data. I suspect the difficulty of a tool
> > like this will be more in the communication of issues and severity,
> > more than in what exactly we choose to check.
> 
> Frankly I've never been a big fan of how we report warnings like this through
> the health check.  It's important to let users know if they've set up things
> sub-optimally, but I don't think ceph health is the way to do it.  The
> difference between your doctor telling you you should exercise more and
> lose a few pounds vs you have Ebola and are going to suffer an incredibly
> gruesome and painful death in the next 48 hours. :)
> 

Since I was the one at the SDI Summit that took issue with some of these warnings, I whole-heartedly agree with Greg's and Mark's comments. A warning at health check should indicate to the user that some corrective action should be taken, besides turning the warning off :-) I do not have an issue reporting advisories, but they should be kept separate true warnings. If we want to notify the user of variances from best practices, I suggest a separate method, i.e. "ceph advise", rather than constantly repeating them on health checks.

> > -Greg
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the
> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
> http://vger.kernel.org/majordomo-info.html
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f