I don't have a solution to offer, but I've seen this for years with no solution. Any time a MGR bounces, be it for upgrades, or a new daemon coming online, etc, I'll see a scale spike like is reported below. Just out of curiosity, which MGR plugins are you using? I have historically used the influx plugin for stats exports, and it shows up in those values as well, throwing everything off. I don't see it in my Zabbix stats, albeit those are scraped at a longer interval that may not catch this. Just looking for any common threads. Reed > On May 4, 2021, at 3:46 AM, Nico Schottelius <nico.schottelius@xxxxxxxxxxx> wrote: > > > Hello, > > we have a recurring, funky problem with managers on Nautilus (and > probably also earlier versions): the manager displays incorrect > information. > > This is a recurring pattern and it also breaks the prometheus graphs, as > the I/O is described insanely incorrectly: "recovery: 43 TiB/s, 3.62k > keys/s, 11.40M objects/s" - which basically changes the scale of any > related graph to unusable. > > The latest example from today shows slow ops for an OSD > that has been down for 17h: > > -------------------------------------------------------------------------------- > [09:50:31] black2.place6:~# ceph -s > cluster: > id: 1ccd84f6-e362-4c50-9ffe-59436745e445 > health: HEALTH_WARN > 18 slow ops, oldest one blocked for 975 sec, osd.53 has slow ops > > services: > mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w) > mgr: server2(active, since 2w), standbys: server8, server4, server9, server6, ciara3 > osd: 108 osds: 107 up (since 17h), 107 in (since 17h) > > data: > pools: 4 pools, 2624 pgs > objects: 42.52M objects, 162 TiB > usage: 486 TiB used, 298 TiB / 784 TiB avail > pgs: 2616 active+clean > 8 active+clean+scrubbing+deep > > io: > client: 522 MiB/s rd, 22 MiB/s wr, 8.18k op/s rd, 689 op/s wr > -------------------------------------------------------------------------------- > > Killing the manager on server2 changes the status to another temporary > incorrect status, because the rebalance finished hours ago, paired with > the incorrect rebalance speed that we see from time to time: > > -------------------------------------------------------------------------------- > [09:51:59] black2.place6:~# ceph -s > cluster: > id: 1ccd84f6-e362-4c50-9ffe-59436745e445 > health: HEALTH_OK > > services: > mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w) > mgr: server8(active, since 11s), standbys: server4, server9, server6, ciara3 > osd: 108 osds: 107 up (since 17h), 107 in (since 17h) > > data: > pools: 4 pools, 2624 pgs > objects: 42.52M objects, 162 TiB > usage: 486 TiB used, 298 TiB / 784 TiB avail > pgs: 2616 active+clean > 8 active+clean+scrubbing+deep > > io: > client: 214 TiB/s rd, 54 TiB/s wr, 4.86G op/s rd, 1.06G op/s wr > recovery: 43 TiB/s, 3.62k keys/s, 11.40M objects/s > > progress: > Rebalancing after osd.53 marked out > [========================......] > -------------------------------------------------------------------------------- > > Then a bit later, the status on the newly started manager is correct: > > -------------------------------------------------------------------------------- > [09:52:18] black2.place6:~# ceph -s > cluster: > id: 1ccd84f6-e362-4c50-9ffe-59436745e445 > health: HEALTH_OK > > services: > mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w) > mgr: server8(active, since 47s), standbys: server4, server9, server6, server2, ciara3 > osd: 108 osds: 107 up (since 17h), 107 in (since 17h) > > data: > pools: 4 pools, 2624 pgs > objects: 42.52M objects, 162 TiB > usage: 486 TiB used, 298 TiB / 784 TiB avail > pgs: 2616 active+clean > 8 active+clean+scrubbing+deep > > io: > client: 422 MiB/s rd, 39 MiB/s wr, 7.91k op/s rd, 752 op/s wr > -------------------------------------------------------------------------------- > > Question: is this a know bug, is anyone else seeing it or are we doing > something wrong? > > Best regards, > > Nico > > -- > Sustainable and modern Infrastructures by ungleich.ch > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx