Re: Manager carries wrong information until killing it

涂振南 <zn.tu@xxxxxxxxxxxxxxxxxx> · Tue, 14 Dec 2021 17:24:39 +0000

Hi there,
I ask you to look for additional information and write me the end result. Down below I send the legal request.

crhconsultores.co.mz/minimanumquam/autnumquamqui

Hello, we have a recurring, funky problem with managers on Nautilus (and probably also earlier versions): the manager displays incorrect information. This is a recurring pattern and it also breaks the prometheus graphs, as the I/O is described insanely incorrectly: "recovery: 43 TiB/s, 3.62k keys/s, 11.40M objects/s" - which basically changes the scale of any related graph to unusable. The latest example from today shows slow ops for an OSD that has been down for 17h: -------------------------------------------------------------------------------- [09:50:31] black2.place6:~# ceph -s cluster: id: 1ccd84f6-e362-4c50-9ffe-59436745e445 health: HEALTH_WARN 18 slow ops, oldest one blocked for 975 sec, osd.53 has slow ops services: mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w) mgr: server2(active, since 2w), standbys: server8, server4, server9, server6, ciara3 osd: 108 osds: 107 up (since 17h), 107 in (since 17h) data: pools: 4 pools, 2624 pgs objects: 42.52M objec
 ts, 162 TiB usage: 486 TiB used, 298 TiB / 784 TiB avail pgs: 2616 active+clean 8 active+clean+scrubbing+deep io: client: 522 MiB/s rd, 22 MiB/s wr, 8.18k op/s rd, 689 op/s wr -------------------------------------------------------------------------------- Killing the manager on server2 changes the status to another temporary incorrect status, because the rebalance finished hours ago, paired with the incorrect rebalance speed that we see from time to time: -------------------------------------------------------------------------------- [09:51:59] black2.place6:~# ceph -s cluster: id: 1ccd84f6-e362-4c50-9ffe-59436745e445 health: HEALTH_OK services: mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w) mgr: server8(active, since 11s), standbys: server4, server9, server6, ciara3 osd: 108 osds: 107 up (since 17h), 107 in (since 17h) data: pools: 4 pools, 2624 pgs objects: 42.52M objects, 162 TiB usage: 486 TiB used, 298 TiB / 784 TiB avail pgs: 2616 active+clean 8 acti
 ve+clean+scrubbing+deep io: client: 214 TiB/s rd, 54 TiB/s wr, 4.86G op/s rd, 1.06G op/s wr recovery: 43 TiB/s, 3.62k keys/s, 11.40M objects/s progress: Rebalancing after osd.53 marked out [========================......] -------------------------------------------------------------------------------- Then a bit later, the status on the newly started manager is correct: -------------------------------------------------------------------------------- [09:52:18] black2.place6:~# ceph -s cluster: id: 1ccd84f6-e362-4c50-9ffe-59436745e445 health: HEALTH_OK services: mon: 5 daemons, quorum server9,server2,server8,server6,server4 (age 2w) mgr: server8(active, since 47s), standbys: server4, server9, server6, server2, ciara3 osd: 108 osds: 107 up (since 17h), 107 in (since 17h) data: pools: 4 pools, 2624 pgs objects: 42.52M objects, 162 TiB usage: 486 TiB used, 298 TiB / 784 TiB avail pgs: 2616 active+clean 8 active+clean+scrubbing+deep io: client: 422 MiB/s rd, 39 MiB/s wr, 7.91k op/s rd, 7
 52 op/s wr -------------------------------------------------------------------------------- Question: is this a know bug, is anyone else seeing it or are we doing something wrong? Best regards, Nico -- Sustainable and modern Infrastructures by ungleich.ch _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx