RE: Aggregate failure report in ceph -s

"Chen, Xiaoxi" <xiaoxi.chen@xxxxxxxxx> · Sat, 21 Nov 2015 15:46:00 +0000

Hi Sage,

Yeah, that is what I mean, and the output make more sense than I was thinking before (only show if I cannot reach the whole crush level ).

I will try to to it then.

Thanks.

-Xiaoxi

> -----Original Message-----
> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> Sent: Friday, November 20, 2015 7:28 PM
> To: Chen, Xiaoxi
> Cc: ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: Aggregate failure report in ceph -s
> 
> On Fri, 20 Nov 2015, Chen, Xiaoxi wrote:
> >
> > Hi Sage,
> >
> >        As we are looking at the failure detection part of
> > ceph(basically around osd flipping issue), we  got some suggestion
> > from customer that showing the aggregated failure report in ?ceph ?s?.
> The idea is:
> >
> >       When an OSD find it cannot hear heartbeat from some of the
> > peers, it will try to aggregate the failure domain, say ?I cannot
> > reach all my peers in Rack C,    something wrong??  and this kind of
> > log will be showed on ceph ?s.   So if we see ceph ?s and notice a lot
> > of complain saying cannot reach Rack C, we will easily diagnose the Rack C
> has some network issue.
> >
> >
> >
> >       Is that make sense?
> 
> Yeah, sounds reasonable to me!  It's a bit more awkward to do this at the
> mon level since rack C may talk to the mon, but doing it at the OSD makes
> sense.  There will be a lot of heuristics involved, though.  I expect the
> messages might include
> 
> - cannot reach _% of peers outside of my $crushlevel $foo [on front|back]
> - cannot reach _% of hosts in $crushlevel $foo [on front|back]
> 
> ?
> 
> Also note that it would be easiest to log these in the cluster log (ceph -w, not
> ceph -s).. I'm guessing that's what you mean?
> 
> Thanks!
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html