Re: Nautilus 14.2.19 mon 100% CPU

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Apr 9, 2021 at 11:50 PM Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
>
> On Fri, Apr 9, 2021 at 2:04 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> >
> > On Fri, Apr 9, 2021 at 9:37 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> > >
> > > On Fri, Apr 9, 2021 at 8:39 PM Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
> > > >
> > > > On Fri, Apr 9, 2021 at 11:49 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > Thanks. I didn't see anything ultra obvious to me.
> > > > >
> > > > > But I did notice the nearfull warnings so I wonder if this cluster is
> > > > > churning through osdmaps? Did you see a large increase in inbound or
> > > > > outbound network traffic on this mon following the upgrade?
> > > > > Totally speculating here, but maybe there is an issue where you have
> > > > > some old clients, which can't decode an incremental osdmap from a
> > > > > nautilus mon, so the single mon is busy serving up these maps to the
> > > > > clients.
> > > > >
> > > > > Does the mon load decrease if you stop the osdmap churn?, e.g. by
> > > > > setting norebalance if that is indeed ongoing.
> > > > >
> > > > > Could you also share debug_ms = 1 for a minute of busy cpu mon?
> > > >
> > > > Here are the new logs with the debug_ms=1 for a bit.
> > > > https://owncloud.leblancnet.us/owncloud/index.php/s/1hvtJo3s2oLPpWn
> > >
> > > Something strange in this is there is one hammer client that is asking
> > > for nearly a million incremental osdmaps, seemingly every 30s:
> > >
> > >     client.131831153 at 172.16.212.55 is asking for incrementals from
> > > 1170448..1987355 (see [1])
> > >
> > > Can you try to evict/kill/block that client and see if your mon load drops?
> > >
> >
> > Before you respond, just noting here ftr that i think there's a
> > possible issue with OSDMonitor::get_removed_snaps_range and clients
> > like this.
> >
> >     https://github.com/ceph/ceph/blob/v14.2.19/src/mon/OSDMonitor.cc#L4193
> >
> > Called by send_incremental:
> >
> >     https://github.com/ceph/ceph/blob/v14.2.19/src/mon/OSDMonitor.cc#L4152
> >
> > When building the incremental it will search the mon's rocksdb for
> > removed snaps across those ~million missing maps.
> >
> > That feature seems removed from octopus onward.
>
> I evicted that client and CPU hasn't gone down significantly. There
> may be other clients also causing the issue. Was it the
> `osdmap=1170448` part of the line that says how many OSDmaps it's
> trying to get? I can look for others in the logs and evict them as
> well.
>

Here's what you should look for, with debug_mon=10. It shows clearly
that it takes the mon 23 seconds to run through
get_removed_snaps_range.
So if this is happening every 30s, it explains at least part of why
this mon is busy.

2021-04-09 17:07:27.238 7f9fc83e4700 10 mon.sun-storemon01@0(leader)
e45 handle_subscribe
mon_subscribe({mdsmap=3914079+,monmap=0+,osdmap=1170448})
2021-04-09 17:07:27.238 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 check_osdmap_sub
0x55e2e2133de0 next 1170448 (onetime)
2021-04-09 17:07:27.238 7f9fc83e4700  5
mon.sun-storemon01@0(leader).osd e1987355 send_incremental
[1170448..1987355] to client.131831153
2021-04-09 17:07:28.590 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 0
[1~3]
2021-04-09 17:07:29.898 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 5 []
2021-04-09 17:07:31.258 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 6 []
2021-04-09 17:07:32.562 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 20
[]
2021-04-09 17:07:33.866 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 21
[]
2021-04-09 17:07:35.162 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 22
[]
2021-04-09 17:07:36.470 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 23
[]
2021-04-09 17:07:37.778 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 24
[]
2021-04-09 17:07:39.090 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 25
[]
2021-04-09 17:07:40.398 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 26
[]
2021-04-09 17:07:41.706 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 27
[]
2021-04-09 17:07:43.006 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 28
[]
2021-04-09 17:07:44.322 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 29
[]
2021-04-09 17:07:45.630 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 30
[]
2021-04-09 17:07:46.938 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 31
[]
2021-04-09 17:07:48.246 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 32
[]
2021-04-09 17:07:49.562 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 34
[]
2021-04-09 17:07:50.862 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 35
[]
2021-04-09 17:07:50.862 7f9fc83e4700 20
mon.sun-storemon01@0(leader).osd e1987355 send_incremental starting
with base full 1986745 664086 bytes
2021-04-09 17:07:50.862 7f9fc83e4700 10
mon.sun-storemon01@0(leader).osd e1987355 build_incremental
[1986746..1986785] with features 107b84a842aca

So have a look for that client again or other similar traces.

> Maybe if that code path isn't needed in Nautilus it can be removed in
> the next point release?

I think there were other major changes in this area that might make
such a backport difficult. And we should expect nautilus to be nearing
its end...


-- dan


>
> Thank you,
> Robert LeBlanc
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux