This sounds similar to an inquiry I submitted a couple years ago [1] whereby I discovered that the choose_acting function does not consider primary affinity when choosing the primary osd. I had made the assumption it would when developing my procedure for replacing failing disks. After that discovery I change my process to stop the OSD daemon failing (degraded pgs) to ensure its not participating in PG anymore. Not sure if any of the relevant code regarding this has changed since that initial submit but what you describe here seems similar. [1] https://tracker.ceph.com/issues/44400 Respectfully, *Wes Dillingham* wes@xxxxxxxxxxxxxxxxx LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Fri, May 20, 2022 at 7:53 AM Denis Polom <denispolom@xxxxxxxxx> wrote: > Hi > > I observed high latencies and mount points hanging since Octopus release > and it's still observed on Pacific latest while draining OSD. > > Cluster setup: > > Ceph Pacific 16.2.7 > > Cephfs with EC data pool > > EC profile setup: > > crush-device-class= > crush-failure-domain=host > crush-root=default > jerasure-per-chunk-alignment=false > k=10 > m=2 > plugin=jerasure > technique=reed_sol_van > w=8 > > Description: > > If we have broken drive, we are removing it from Ceph cluster by > draining it first. That means changing its crush weight to 0 > > ceph osd crush reweight osd.1 0 > > Normally on Nautilus it didn't affected clients. But after upgrade to > Octopus (and since Octopus till current Pacific release) I can observe > very high IO latencies on clients while OSD being drained (10sec and > higher). > > By debugging I found out that drained OSD is still listed as > ACTIVE_PRIMARY and that happens only on EC pools and only since Octopus. > I tested it back on Nautilus, to be sure, where behavior is correct and > drained OSD is not listed under UP and ACTIVE OSDs for PGs. > > Even if setting up primary-affinity for given OSD to 0 this doesn't have > any effect on EC pool. > > Bellow are my debugs: > > Buggy behavior on Octopus and Pacific: > > Before draining osd.70: > > PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND > BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG > STATE STATE_STAMP VERSION > REPORTED UP UP_PRIMARY ACTING > ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB > DEEP_SCRUB_STAMP SNAPTRIMQ_LEN > 16.1fff 2269 0 0 0 0 > 8955297727 0 0 2449 2449 > active+clean 2022-05-19T08:41:55.241734+0200 19403690'275685 > 19407588:19607199 [70,206,216,375,307,57] 70 > [70,206,216,375,307,57] 70 19384365'275621 > 2022-05-19T08:41:55.241493+0200 19384365'275621 > 2022-05-19T08:41:55.241493+0200 0 > dumped pgs > > > after setting osd.70 crush weight to 0 (osd.70 is still acting primary): > > UP UP_PRIMARY ACTING > ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP > LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN > 16.1fff 2269 0 0 2269 0 > 8955297727 0 0 2449 2449 > active+remapped+backfill_wait 2022-05-20T08:51:54.249071+0200 > 19403690'275685 19407668:19607289 [71,206,216,375,307,57] 71 > [70,206,216,375,307,57] 70 19384365'275621 > 2022-05-19T08:41:55.241493+0200 19384365'275621 > 2022-05-19T08:41:55.241493+0200 0 > dumped pgs > > > Correct behavior on Nautilus: > > Before draining osd.10: > > PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES > OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP > VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY > LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP > SNAPTRIMQ_LEN > 2.4e 2 0 0 0 0 > 8388608 0 0 2 2 active+clean 2022-05-20 > 02:13:47.432104 61'2 75:40 [10,0,7] 10 [10,0,7] > 10 0'0 2022-05-20 01:44:36.217286 0'0 2022-05-20 > 01:44:36.217286 0 > > after setting osd.10 crush weight to 0 (behavior is correct, osd.10 is > not listed, not used): > > > root@nautilus1:~# ceph pg dump pgs | head -2 > PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES > OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE > STATE_STAMP VERSION REPORTED UP UP_PRIMARY > ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP > LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN > 2.4e 14 0 0 0 0 > 58720256 0 0 18 18 active+clean 2022-05-20 > 02:18:59.414812 75'18 80:43 [22,0,7] 22 > [22,0,7] 22 0'0 2022-05-20 > 01:44:36.217286 0'0 2022-05-20 01:44:36.217286 0 > > > Now question is if is it some implemented feature? > > Or is it a bug? > > Thank you! > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx