Re: Nautilus 14.2.19 mon 100% CPU

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 29 Apr 2021 21:24:33 +0200

On Sat, Apr 10, 2021 at 2:10 AM Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
>
> On Fri, Apr 9, 2021 at 4:04 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> >
> > Here's what you should look for, with debug_mon=10. It shows clearly
> > that it takes the mon 23 seconds to run through
> > get_removed_snaps_range.
> > So if this is happening every 30s, it explains at least part of why
> > this mon is busy.
> >
> > 2021-04-09 17:07:27.238 7f9fc83e4700 10 mon.sun-storemon01@0(leader)
> > e45 handle_subscribe
> > mon_subscribe({mdsmap=3914079+,monmap=0+,osdmap=1170448})
> > 2021-04-09 17:07:27.238 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 check_osdmap_sub
> > 0x55e2e2133de0 next 1170448 (onetime)
> > 2021-04-09 17:07:27.238 7f9fc83e4700  5
> > mon.sun-storemon01@0(leader).osd e1987355 send_incremental
> > [1170448..1987355] to client.131831153
> > 2021-04-09 17:07:28.590 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 0
> > [1~3]
> > 2021-04-09 17:07:29.898 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 5 []
> > 2021-04-09 17:07:31.258 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 6 []
> > 2021-04-09 17:07:32.562 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 20
> > []
> > 2021-04-09 17:07:33.866 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 21
> > []
> > 2021-04-09 17:07:35.162 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 22
> > []
> > 2021-04-09 17:07:36.470 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 23
> > []
> > 2021-04-09 17:07:37.778 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 24
> > []
> > 2021-04-09 17:07:39.090 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 25
> > []
> > 2021-04-09 17:07:40.398 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 26
> > []
> > 2021-04-09 17:07:41.706 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 27
> > []
> > 2021-04-09 17:07:43.006 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 28
> > []
> > 2021-04-09 17:07:44.322 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 29
> > []
> > 2021-04-09 17:07:45.630 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 30
> > []
> > 2021-04-09 17:07:46.938 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 31
> > []
> > 2021-04-09 17:07:48.246 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 32
> > []
> > 2021-04-09 17:07:49.562 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 34
> > []
> > 2021-04-09 17:07:50.862 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 35
> > []
> > 2021-04-09 17:07:50.862 7f9fc83e4700 20
> > mon.sun-storemon01@0(leader).osd e1987355 send_incremental starting
> > with base full 1986745 664086 bytes
> > 2021-04-09 17:07:50.862 7f9fc83e4700 10
> > mon.sun-storemon01@0(leader).osd e1987355 build_incremental
> > [1986746..1986785] with features 107b84a842aca
> >
> > So have a look for that client again or other similar traces.
>
> So, even though I blacklisted the client and we remounted the file
> system on it, it wasn't enough for it to keep performing the same bad
> requests. We found another node that had two sessions to the same
> mount point. We rebooted both nodes and the CPU is now back at a
> reasonable 4-6% and the cluster is running at full performance again.
> I've added in back both MONs to have all 3 mons in the system and
> there are no more elections. Thank you for helping us track down the
> bad clients out of over 2,000 clients.
>
> > > Maybe if that code path isn't needed in Nautilus it can be removed in
> > > the next point release?
> >
> > I think there were other major changes in this area that might make
> > such a backport difficult. And we should expect nautilus to be nearing
> > its end...
>
> But ... we just got to Nautilus... :)

Ouch, we just suffered this or a similar issue on our big prod block
storage cluster running 14.2.19.
But in our case it wasn't related to an old client -- rather we had
100% mon cpu and election storms but also huge tcmallocs all following
the recreation of a couple OSDs.
We wrote the details here: https://tracker.ceph.com/issues/50587

-- Dan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx