Re: High MON cpu usage when cluster is changing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, 14 Apr 2018, Xiaoxi Chen wrote:
> Hi,
> 
>     we are consistently seeing this issue after upgrading to luminous
> from jewel  , the behavior looks like monitor cannot handle
> mon_subscribe from client fast enough, then we see
> 
>     high cpu (1600% +  with simple messenger) for monitor
>     cluster pg state changing slowly as OSDs cannot get latest map fast enough.
>     in some cases like reboot an OSD node( 24 OSDs per node) can cause
> even bigger impact, OSDs even cannot update their auth in time and
> after a while we saw massive OSDs been marked down due to heartbeat
> failure, like
>        2018-04-11 21:19:24.772558 7f6bbb7f5700  0 cephx server
> osd.234:  unexpected key: req.key=690bba2ca98774a2
> expected_key=f63feaae2014a837
> 2018-04-11 21:19:26.539295 7f6bbb7f5700  0 cephx server osd.365:
> unexpected key: req.key=a0eb995e1bef1bf4 expected_key=bafe2e4d55a63478
> 
>    There are a bit more details about the attempts we have made , in
> the ticket  http://tracker.ceph.com/issues/23713.
> 
>    Any suggestion is much appreciated. Thanks.

My guess is that this is the compat reencoding of the OSDMap for the 
pre-luminous clients.

Are you by chance making use of the crush-compat balancer? That would 
additionally require a reencoded crush map.

Can you do a 'perf top -p `pidof ceph-mon`' while this is happening to see 
where the time is being spent?

Thanks!
sage


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux