pg repair doesn't start

Frank Schilder <frans@xxxxxx> · Thu, 13 Oct 2022 21:06:37 +0000

Hi all,

we have an inconsistent PG for a couple of days now (octopus latest):

# ceph status
  cluster:
    id:     
    health: HEALTH_ERR
            1 scrub errors
            Possible data damage: 1 pg inconsistent

  services:
    mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 8d)
    mgr: ceph-25(active, since 8d), standbys: ceph-26, ceph-03, ceph-02, ceph-01
    mds: con-fs2:8 4 up:standby 8 up:active
    osd: 1086 osds: 1071 up (since 13h), 1070 in (since 4d); 547 remapped pgs

  task status:

  data:
    pools:   14 pools, 17185 pgs
    objects: 1.39G objects, 2.5 PiB
    usage:   3.1 PiB used, 8.4 PiB / 11 PiB avail
    pgs:     305530535/11943726075 objects misplaced (2.558%)
             16614 active+clean
             516   active+remapped+backfill_wait
             23    active+clean+scrubbing+deep
             21    active+remapped+backfilling
             10    active+remapped+backfill_wait+forced_backfill
             1     active+clean+inconsistent

  io:
    client:   143 MiB/s rd, 135 MiB/s wr, 2.21k op/s rd, 2.33k op/s wr
    recovery: 0 B/s, 224 objects/s

I issued "ceph pg repair 11.1ba" more than 36 hours ago, but it never got executed (checked the logs for repair state). The usual wait time we had on our cluster so far was 2-6 hours. 36 hours is unusually long. The pool in question is moderately busy and has no misplaced ojects. Its only unhealthy PG is the inconsistent one.

Are there situations in which ceph cancels/ignores a pg repair?
Is there any way to check if it is actually still scheduled to happen?
Is there a way to force it a bit more urgently?

The error was caused by a read error, the drive is healthy:

2022-10-11T19:19:13.621470+0200 osd.231 (osd.231) 40 : cluster [ERR] 11.1ba shard 294(6) soid 11:5df75341:::rbd_data.1.b688997dc79def.000000000005d530:head : candidate had a read error
2022-10-11T19:26:22.344862+0200 osd.231 (osd.231) 41 : cluster [ERR] 11.1bas0 deep-scrub 0 missing, 1 inconsistent objects
2022-10-11T19:26:22.344866+0200 osd.231 (osd.231) 42 : cluster [ERR] 11.1ba deep-scrub 1 errors
2022-10-11T19:26:23.356402+0200 mgr.ceph-25 (mgr.144330518) 378551 : cluster [DBG] pgmap v301249: 17334 pgs: 1 active+clean+inconsistent, 2 active+clean+scrubbing, 26 active+remapped+backfill_wait, 13 active+remapped+backfilling, 19 active+clean+scrubbing+deep, 17273 active+clean; 2.5 PiB data, 3.1 PiB used, 8.4 PiB / 11 PiB avail; 193 MiB/s rd, 181 MiB/s wr, 4.95k op/s; 16126995/11848511097 objects misplaced (0.136%); 0 B/s, 513 objects/s recovering
2022-10-11T19:26:24.246194+0200 mon.ceph-01 (mon.0) 633486 : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
2022-10-11T19:26:24.246215+0200 mon.ceph-01 (mon.0) 633487 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx