On Fri, Apr 9, 2021 at 11:50 PM Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: > > On Fri, Apr 9, 2021 at 2:04 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > > > On Fri, Apr 9, 2021 at 9:37 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > > > > > On Fri, Apr 9, 2021 at 8:39 PM Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: > > > > > > > > On Fri, Apr 9, 2021 at 11:49 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote: > > > > > > > > > > Thanks. I didn't see anything ultra obvious to me. > > > > > > > > > > But I did notice the nearfull warnings so I wonder if this cluster is > > > > > churning through osdmaps? Did you see a large increase in inbound or > > > > > outbound network traffic on this mon following the upgrade? > > > > > Totally speculating here, but maybe there is an issue where you have > > > > > some old clients, which can't decode an incremental osdmap from a > > > > > nautilus mon, so the single mon is busy serving up these maps to the > > > > > clients. > > > > > > > > > > Does the mon load decrease if you stop the osdmap churn?, e.g. by > > > > > setting norebalance if that is indeed ongoing. > > > > > > > > > > Could you also share debug_ms = 1 for a minute of busy cpu mon? > > > > > > > > Here are the new logs with the debug_ms=1 for a bit. > > > > https://owncloud.leblancnet.us/owncloud/index.php/s/1hvtJo3s2oLPpWn > > > > > > Something strange in this is there is one hammer client that is asking > > > for nearly a million incremental osdmaps, seemingly every 30s: > > > > > > client.131831153 at 172.16.212.55 is asking for incrementals from > > > 1170448..1987355 (see [1]) > > > > > > Can you try to evict/kill/block that client and see if your mon load drops? > > > > > > > Before you respond, just noting here ftr that i think there's a > > possible issue with OSDMonitor::get_removed_snaps_range and clients > > like this. > > > > https://github.com/ceph/ceph/blob/v14.2.19/src/mon/OSDMonitor.cc#L4193 > > > > Called by send_incremental: > > > > https://github.com/ceph/ceph/blob/v14.2.19/src/mon/OSDMonitor.cc#L4152 > > > > When building the incremental it will search the mon's rocksdb for > > removed snaps across those ~million missing maps. > > > > That feature seems removed from octopus onward. > > I evicted that client and CPU hasn't gone down significantly. There > may be other clients also causing the issue. Was it the > `osdmap=1170448` part of the line that says how many OSDmaps it's > trying to get? I can look for others in the logs and evict them as > well. > Here's what you should look for, with debug_mon=10. It shows clearly that it takes the mon 23 seconds to run through get_removed_snaps_range. So if this is happening every 30s, it explains at least part of why this mon is busy. 2021-04-09 17:07:27.238 7f9fc83e4700 10 mon.sun-storemon01@0(leader) e45 handle_subscribe mon_subscribe({mdsmap=3914079+,monmap=0+,osdmap=1170448}) 2021-04-09 17:07:27.238 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 check_osdmap_sub 0x55e2e2133de0 next 1170448 (onetime) 2021-04-09 17:07:27.238 7f9fc83e4700 5 mon.sun-storemon01@0(leader).osd e1987355 send_incremental [1170448..1987355] to client.131831153 2021-04-09 17:07:28.590 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 0 [1~3] 2021-04-09 17:07:29.898 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 5 [] 2021-04-09 17:07:31.258 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 6 [] 2021-04-09 17:07:32.562 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 20 [] 2021-04-09 17:07:33.866 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 21 [] 2021-04-09 17:07:35.162 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 22 [] 2021-04-09 17:07:36.470 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 23 [] 2021-04-09 17:07:37.778 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 24 [] 2021-04-09 17:07:39.090 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 25 [] 2021-04-09 17:07:40.398 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 26 [] 2021-04-09 17:07:41.706 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 27 [] 2021-04-09 17:07:43.006 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 28 [] 2021-04-09 17:07:44.322 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 29 [] 2021-04-09 17:07:45.630 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 30 [] 2021-04-09 17:07:46.938 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 31 [] 2021-04-09 17:07:48.246 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 32 [] 2021-04-09 17:07:49.562 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 34 [] 2021-04-09 17:07:50.862 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 get_removed_snaps_range 35 [] 2021-04-09 17:07:50.862 7f9fc83e4700 20 mon.sun-storemon01@0(leader).osd e1987355 send_incremental starting with base full 1986745 664086 bytes 2021-04-09 17:07:50.862 7f9fc83e4700 10 mon.sun-storemon01@0(leader).osd e1987355 build_incremental [1986746..1986785] with features 107b84a842aca So have a look for that client again or other similar traces. > Maybe if that code path isn't needed in Nautilus it can be removed in > the next point release? I think there were other major changes in this area that might make such a backport difficult. And we should expect nautilus to be nearing its end... -- dan > > Thank you, > Robert LeBlanc _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx