On Sat, 14 Apr 2018, Xiaoxi Chen wrote: > Hi, > > we are consistently seeing this issue after upgrading to luminous > from jewel , the behavior looks like monitor cannot handle > mon_subscribe from client fast enough, then we see > > high cpu (1600% + with simple messenger) for monitor > cluster pg state changing slowly as OSDs cannot get latest map fast enough. > in some cases like reboot an OSD node( 24 OSDs per node) can cause > even bigger impact, OSDs even cannot update their auth in time and > after a while we saw massive OSDs been marked down due to heartbeat > failure, like > 2018-04-11 21:19:24.772558 7f6bbb7f5700 0 cephx server > osd.234: unexpected key: req.key=690bba2ca98774a2 > expected_key=f63feaae2014a837 > 2018-04-11 21:19:26.539295 7f6bbb7f5700 0 cephx server osd.365: > unexpected key: req.key=a0eb995e1bef1bf4 expected_key=bafe2e4d55a63478 > > There are a bit more details about the attempts we have made , in > the ticket http://tracker.ceph.com/issues/23713. > > Any suggestion is much appreciated. Thanks. My guess is that this is the compat reencoding of the OSDMap for the pre-luminous clients. Are you by chance making use of the crush-compat balancer? That would additionally require a reencoded crush map. Can you do a 'perf top -p `pidof ceph-mon`' while this is happening to see where the time is being spent? Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html