High MON cpu usage when cluster is changing

Xiaoxi Chen <superdebuger@xxxxxxxxx> · Sat, 14 Apr 2018 02:22:56 +0800

Hi,

    we are consistently seeing this issue after upgrading to luminous
from jewel  , the behavior looks like monitor cannot handle
mon_subscribe from client fast enough, then we see

    high cpu (1600% +  with simple messenger) for monitor
    cluster pg state changing slowly as OSDs cannot get latest map fast enough.
    in some cases like reboot an OSD node( 24 OSDs per node) can cause
even bigger impact, OSDs even cannot update their auth in time and
after a while we saw massive OSDs been marked down due to heartbeat
failure, like
       2018-04-11 21:19:24.772558 7f6bbb7f5700  0 cephx server
osd.234:  unexpected key: req.key=690bba2ca98774a2
expected_key=f63feaae2014a837
2018-04-11 21:19:26.539295 7f6bbb7f5700  0 cephx server osd.365:
unexpected key: req.key=a0eb995e1bef1bf4 expected_key=bafe2e4d55a63478

   There are a bit more details about the attempts we have made , in
the ticket  http://tracker.ceph.com/issues/23713.

   Any suggestion is much appreciated. Thanks.

Xiaoxi
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html