On Tue, Dec 1, 2015 at 5:23 AM, Vimal <vikumar@xxxxxxxxxx> wrote: > Hello, > > This mail is to discuss the feature request at > http://tracker.ceph.com/issues/13578. > > If done, such a tool should help point out several mis-configurations that > may cause problems in a cluster later. > > Some of the suggestions are: > > a) A check to understand if the MONs and OSD nodes are on the same machines. > > b) If /var is a separate partition or not, to prevent the root filesystem > from being filled up. > > c) If monitors are deployed in different failure domains or not. > > d) If the OSDs are deployed in different failure domains. > > e) If a journal disk is used for more than six OSDs. Right now, the > documentation suggests upto 6 OSD journals to exist on a single journal > disk. > > f) Failure domains depending on the power source. > > There can be several more checks, and it can be a useful tool to test the > problems an existing cluster or a new installation. > > But I'd like to know how the engineering community sees this, if its seems > to be worth pursuing, and what suggestions do you have for improving/adding > to this. This is a user experience and support tool; I don't think the engineering community can really judge its value. ;) So sure, sounds good to me. It'll need to get into the hands of users before we find out if it's a good plan or not. I was at the SDI Summit yesterday and was hearing about how some of our choices (like HEALTH_WARN on pg counts) are *really* scary for users who think they're in danger of losing data. I suspect the difficulty of a tool like this will be more in the communication of issues and severity, more than in what exactly we choose to check. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html