Very slow backfilling/remapping of EC pool PGs

Gauvain Pocentek <gauvainpocentek@xxxxxxxxx> · Tue, 21 Mar 2023 06:53:02 +0100

Hello all,

We have an EC (4+2) pool for RGW data, with HDDs + SSDs for WAL/DB. This
pool has 9 servers with each 12 disks of 16TBs. About 10 days ago we lost a
server and we've removed its OSDs from the cluster. Ceph has started to
remap and backfill as expected, but the process has been getting slower and
slower. Today the recovery rate is around 12 MiB/s and 10 objects/s. All
the remaining unclean PGs are backfilling:

  data:
    volumes: 1/1 healthy
    pools:   14 pools, 14497 pgs
    objects: 192.38M objects, 380 TiB
    usage:   764 TiB used, 1.3 PiB / 2.1 PiB avail
    pgs:     771559/1065561630 objects degraded (0.072%)
             1215899/1065561630 objects misplaced (0.114%)
             14428 active+clean
             50    active+undersized+degraded+remapped+backfilling
             18    active+remapped+backfilling
             1     active+clean+scrubbing+deep

We've checked the health of the remaining servers, and everything looks
like (CPU/RAM/network/disks).

Any hints on what could be happening?

Thank you,
Gauvain
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx