Hi Eugen, thanks for your answer. I gave a search another try and did indeed find something: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/TN6WJVCHTVJ4YIA4JH2D2WYYZFZRMSXI/ Quote: " ... And I've also observed that the repair req isn't queued up -- if the OSDs are busy with other scrubs, the repair req is forgotten. ..." I'm biting my tongue really really hard right now. @Dan (if you read this), thanks for the script: https://github.com/cernceph/ceph-scripts/blob/master/tools/scrubbing/autorepair.sh New status: # ceph status cluster: id: e4ece518-f2cb-4708-b00f-b6bf511e91d9 health: HEALTH_ERR 1 scrub errors Possible data damage: 1 pg inconsistent services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 8d) mgr: ceph-25(active, since 8d), standbys: ceph-26, ceph-03, ceph-02, ceph-01 mds: con-fs2:8 4 up:standby 8 up:active osd: 1086 osds: 1071 up (since 14h), 1070 in (since 4d); 542 remapped pgs task status: data: pools: 14 pools, 17185 pgs objects: 1.39G objects, 2.5 PiB usage: 3.1 PiB used, 8.4 PiB / 11 PiB avail pgs: 301878494/11947144857 objects misplaced (2.527%) 16634 active+clean 513 active+remapped+backfill_wait 19 active+remapped+backfilling 10 active+remapped+backfill_wait+forced_backfill 6 active+clean+scrubbing+deep 2 active+clean+scrubbing 1 active+clean+scrubbing+deep+inconsistent+repair io: client: 444 MiB/s rd, 446 MiB/s wr, 2.19k op/s rd, 2.34k op/s wr recovery: 0 B/s, 223 objects/s Yay! Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Eugen Block <eblock@xxxxxx> Sent: 13 October 2022 23:23:10 To: ceph-users@xxxxxxx Subject: Re: pg repair doesn't start Hi, I’m not sure if I remember correctly but I believe the backfill is preventing the repair to happen. I think it has been discussed a couple of times on this list but I don’t know right now if you can tweak anything to prioritize the repair, I believe there is, but not sure. It looks like your backfill could take quite some time… Zitat von Frank Schilder <frans@xxxxxx>: > Hi all, > > we have an inconsistent PG for a couple of days now (octopus latest): > > # ceph status > cluster: > id: > health: HEALTH_ERR > 1 scrub errors > Possible data damage: 1 pg inconsistent > > services: > mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 8d) > mgr: ceph-25(active, since 8d), standbys: ceph-26, ceph-03, > ceph-02, ceph-01 > mds: con-fs2:8 4 up:standby 8 up:active > osd: 1086 osds: 1071 up (since 13h), 1070 in (since 4d); 547 remapped pgs > > task status: > > data: > pools: 14 pools, 17185 pgs > objects: 1.39G objects, 2.5 PiB > usage: 3.1 PiB used, 8.4 PiB / 11 PiB avail > pgs: 305530535/11943726075 objects misplaced (2.558%) > 16614 active+clean > 516 active+remapped+backfill_wait > 23 active+clean+scrubbing+deep > 21 active+remapped+backfilling > 10 active+remapped+backfill_wait+forced_backfill > 1 active+clean+inconsistent > > io: > client: 143 MiB/s rd, 135 MiB/s wr, 2.21k op/s rd, 2.33k op/s wr > recovery: 0 B/s, 224 objects/s > > I issued "ceph pg repair 11.1ba" more than 36 hours ago, but it > never got executed (checked the logs for repair state). The usual > wait time we had on our cluster so far was 2-6 hours. 36 hours is > unusually long. The pool in question is moderately busy and has no > misplaced ojects. Its only unhealthy PG is the inconsistent one. > > Are there situations in which ceph cancels/ignores a pg repair? > Is there any way to check if it is actually still scheduled to happen? > Is there a way to force it a bit more urgently? > > The error was caused by a read error, the drive is healthy: > > 2022-10-11T19:19:13.621470+0200 osd.231 (osd.231) 40 : cluster [ERR] > 11.1ba shard 294(6) soid > 11:5df75341:::rbd_data.1.b688997dc79def.000000000005d530:head : > candidate had a read error > 2022-10-11T19:26:22.344862+0200 osd.231 (osd.231) 41 : cluster [ERR] > 11.1bas0 deep-scrub 0 missing, 1 inconsistent objects > 2022-10-11T19:26:22.344866+0200 osd.231 (osd.231) 42 : cluster [ERR] > 11.1ba deep-scrub 1 errors > 2022-10-11T19:26:23.356402+0200 mgr.ceph-25 (mgr.144330518) 378551 : > cluster [DBG] pgmap v301249: 17334 pgs: 1 active+clean+inconsistent, > 2 active+clean+scrubbing, 26 active+remapped+backfill_wait, 13 > active+remapped+backfilling, 19 active+clean+scrubbing+deep, 17273 > active+clean; 2.5 PiB data, 3.1 PiB used, 8.4 PiB / 11 PiB avail; > 193 MiB/s rd, 181 MiB/s wr, 4.95k op/s; 16126995/11848511097 objects > misplaced (0.136%); 0 B/s, 513 objects/s recovering > 2022-10-11T19:26:24.246194+0200 mon.ceph-01 (mon.0) 633486 : cluster > [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS) > 2022-10-11T19:26:24.246215+0200 mon.ceph-01 (mon.0) 633487 : cluster > [ERR] Health check failed: Possible data damage: 1 pg inconsistent > (PG_DAMAGED) > > Thanks and best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx