Thanks Sage, that make sense, most of our clients are pre-luminous , jewel taking 83% percent and rest are mostly hammer. But we dont enable balancer , nor do `reweight by utilization` at all. I just `ceph osd down osd.0~9` , the osd.9 take a very long time to _boot, during that period, the `crush_hash32_3` looks like the biggest and reach to 30%+, and also a unnamed function ` 0x00000038e9a` also takes 20%. sometimes buffer::ptr also takes 20% 21.69% ceph-mon [.] ceph::buffer::ptr::ptr(ceph::buffer::ptr const&, unsigned int, unsigned int) ◆ 20.60% ceph-mon [.] ceph::buffer::ptr::copy_out(unsigned int, unsigned int, char*) const ▒ 15.77% ceph-mon [.] ceph::buffer::ptr::release() >From 'top -H `-p pid of mon` ' , pipe_writers are full of the list. xiaoxi 2018-04-14 6:05 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>: > On Sat, 14 Apr 2018, Xiaoxi Chen wrote: >> Hi, >> >> we are consistently seeing this issue after upgrading to luminous >> from jewel , the behavior looks like monitor cannot handle >> mon_subscribe from client fast enough, then we see >> >> high cpu (1600% + with simple messenger) for monitor >> cluster pg state changing slowly as OSDs cannot get latest map fast enough. >> in some cases like reboot an OSD node( 24 OSDs per node) can cause >> even bigger impact, OSDs even cannot update their auth in time and >> after a while we saw massive OSDs been marked down due to heartbeat >> failure, like >> 2018-04-11 21:19:24.772558 7f6bbb7f5700 0 cephx server >> osd.234: unexpected key: req.key=690bba2ca98774a2 >> expected_key=f63feaae2014a837 >> 2018-04-11 21:19:26.539295 7f6bbb7f5700 0 cephx server osd.365: >> unexpected key: req.key=a0eb995e1bef1bf4 expected_key=bafe2e4d55a63478 >> >> There are a bit more details about the attempts we have made , in >> the ticket http://tracker.ceph.com/issues/23713. >> >> Any suggestion is much appreciated. Thanks. > > My guess is that this is the compat reencoding of the OSDMap for the > pre-luminous clients. > > Are you by chance making use of the crush-compat balancer? That would > additionally require a reencoded crush map. > > Can you do a 'perf top -p `pidof ceph-mon`' while this is happening to see > where the time is being spent? > > Thanks! > sage > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html