Hi all, we have an inconsistent PG for a couple of days now (octopus latest): # ceph status cluster: id: health: HEALTH_ERR 1 scrub errors Possible data damage: 1 pg inconsistent services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 8d) mgr: ceph-25(active, since 8d), standbys: ceph-26, ceph-03, ceph-02, ceph-01 mds: con-fs2:8 4 up:standby 8 up:active osd: 1086 osds: 1071 up (since 13h), 1070 in (since 4d); 547 remapped pgs task status: data: pools: 14 pools, 17185 pgs objects: 1.39G objects, 2.5 PiB usage: 3.1 PiB used, 8.4 PiB / 11 PiB avail pgs: 305530535/11943726075 objects misplaced (2.558%) 16614 active+clean 516 active+remapped+backfill_wait 23 active+clean+scrubbing+deep 21 active+remapped+backfilling 10 active+remapped+backfill_wait+forced_backfill 1 active+clean+inconsistent io: client: 143 MiB/s rd, 135 MiB/s wr, 2.21k op/s rd, 2.33k op/s wr recovery: 0 B/s, 224 objects/s I issued "ceph pg repair 11.1ba" more than 36 hours ago, but it never got executed (checked the logs for repair state). The usual wait time we had on our cluster so far was 2-6 hours. 36 hours is unusually long. The pool in question is moderately busy and has no misplaced ojects. Its only unhealthy PG is the inconsistent one. Are there situations in which ceph cancels/ignores a pg repair? Is there any way to check if it is actually still scheduled to happen? Is there a way to force it a bit more urgently? The error was caused by a read error, the drive is healthy: 2022-10-11T19:19:13.621470+0200 osd.231 (osd.231) 40 : cluster [ERR] 11.1ba shard 294(6) soid 11:5df75341:::rbd_data.1.b688997dc79def.000000000005d530:head : candidate had a read error 2022-10-11T19:26:22.344862+0200 osd.231 (osd.231) 41 : cluster [ERR] 11.1bas0 deep-scrub 0 missing, 1 inconsistent objects 2022-10-11T19:26:22.344866+0200 osd.231 (osd.231) 42 : cluster [ERR] 11.1ba deep-scrub 1 errors 2022-10-11T19:26:23.356402+0200 mgr.ceph-25 (mgr.144330518) 378551 : cluster [DBG] pgmap v301249: 17334 pgs: 1 active+clean+inconsistent, 2 active+clean+scrubbing, 26 active+remapped+backfill_wait, 13 active+remapped+backfilling, 19 active+clean+scrubbing+deep, 17273 active+clean; 2.5 PiB data, 3.1 PiB used, 8.4 PiB / 11 PiB avail; 193 MiB/s rd, 181 MiB/s wr, 4.95k op/s; 16126995/11848511097 objects misplaced (0.136%); 0 B/s, 513 objects/s recovering 2022-10-11T19:26:24.246194+0200 mon.ceph-01 (mon.0) 633486 : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS) 2022-10-11T19:26:24.246215+0200 mon.ceph-01 (mon.0) 633487 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED) Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx