Hi Sage, Yeah, that is what I mean, and the output make more sense than I was thinking before (only show if I cannot reach the whole crush level ). I will try to to it then. Thanks. -Xiaoxi > -----Original Message----- > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > Sent: Friday, November 20, 2015 7:28 PM > To: Chen, Xiaoxi > Cc: ceph-devel@xxxxxxxxxxxxxxx > Subject: Re: Aggregate failure report in ceph -s > > On Fri, 20 Nov 2015, Chen, Xiaoxi wrote: > > > > Hi Sage, > > > > As we are looking at the failure detection part of > > ceph(basically around osd flipping issue), we got some suggestion > > from customer that showing the aggregated failure report in ?ceph ?s?. > The idea is: > > > > When an OSD find it cannot hear heartbeat from some of the > > peers, it will try to aggregate the failure domain, say ?I cannot > > reach all my peers in Rack C, something wrong?? and this kind of > > log will be showed on ceph ?s. So if we see ceph ?s and notice a lot > > of complain saying cannot reach Rack C, we will easily diagnose the Rack C > has some network issue. > > > > > > > > Is that make sense? > > Yeah, sounds reasonable to me! It's a bit more awkward to do this at the > mon level since rack C may talk to the mon, but doing it at the OSD makes > sense. There will be a lot of heuristics involved, though. I expect the > messages might include > > - cannot reach _% of peers outside of my $crushlevel $foo [on front|back] > - cannot reach _% of hosts in $crushlevel $foo [on front|back] > > ? > > Also note that it would be easiest to log these in the cluster log (ceph -w, not > ceph -s).. I'm guessing that's what you mean? > > Thanks! > sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html