Re: pg repair doesn't start

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I’m not sure if I remember correctly but I believe the backfill is preventing the repair to happen. I think it has been discussed a couple of times on this list but I don’t know right now if you can tweak anything to prioritize the repair, I believe there is, but not sure. It looks like your backfill could take quite some time…

Zitat von Frank Schilder <frans@xxxxxx>:

Hi all,

we have an inconsistent PG for a couple of days now (octopus latest):

# ceph status
  cluster:
    id:
    health: HEALTH_ERR
            1 scrub errors
            Possible data damage: 1 pg inconsistent

  services:
    mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 8d)
mgr: ceph-25(active, since 8d), standbys: ceph-26, ceph-03, ceph-02, ceph-01
    mds: con-fs2:8 4 up:standby 8 up:active
    osd: 1086 osds: 1071 up (since 13h), 1070 in (since 4d); 547 remapped pgs

  task status:

  data:
    pools:   14 pools, 17185 pgs
    objects: 1.39G objects, 2.5 PiB
    usage:   3.1 PiB used, 8.4 PiB / 11 PiB avail
    pgs:     305530535/11943726075 objects misplaced (2.558%)
             16614 active+clean
             516   active+remapped+backfill_wait
             23    active+clean+scrubbing+deep
             21    active+remapped+backfilling
             10    active+remapped+backfill_wait+forced_backfill
             1     active+clean+inconsistent

  io:
    client:   143 MiB/s rd, 135 MiB/s wr, 2.21k op/s rd, 2.33k op/s wr
    recovery: 0 B/s, 224 objects/s

I issued "ceph pg repair 11.1ba" more than 36 hours ago, but it never got executed (checked the logs for repair state). The usual wait time we had on our cluster so far was 2-6 hours. 36 hours is unusually long. The pool in question is moderately busy and has no misplaced ojects. Its only unhealthy PG is the inconsistent one.

Are there situations in which ceph cancels/ignores a pg repair?
Is there any way to check if it is actually still scheduled to happen?
Is there a way to force it a bit more urgently?

The error was caused by a read error, the drive is healthy:

2022-10-11T19:19:13.621470+0200 osd.231 (osd.231) 40 : cluster [ERR] 11.1ba shard 294(6) soid 11:5df75341:::rbd_data.1.b688997dc79def.000000000005d530:head : candidate had a read error 2022-10-11T19:26:22.344862+0200 osd.231 (osd.231) 41 : cluster [ERR] 11.1bas0 deep-scrub 0 missing, 1 inconsistent objects 2022-10-11T19:26:22.344866+0200 osd.231 (osd.231) 42 : cluster [ERR] 11.1ba deep-scrub 1 errors 2022-10-11T19:26:23.356402+0200 mgr.ceph-25 (mgr.144330518) 378551 : cluster [DBG] pgmap v301249: 17334 pgs: 1 active+clean+inconsistent, 2 active+clean+scrubbing, 26 active+remapped+backfill_wait, 13 active+remapped+backfilling, 19 active+clean+scrubbing+deep, 17273 active+clean; 2.5 PiB data, 3.1 PiB used, 8.4 PiB / 11 PiB avail; 193 MiB/s rd, 181 MiB/s wr, 4.95k op/s; 16126995/11848511097 objects misplaced (0.136%); 0 B/s, 513 objects/s recovering 2022-10-11T19:26:24.246194+0200 mon.ceph-01 (mon.0) 633486 : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS) 2022-10-11T19:26:24.246215+0200 mon.ceph-01 (mon.0) 633487 : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent (PG_DAMAGED)

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux