Re: Replace OSD while cluster is recovering?

Gustavo Garcia Rondina <grondina@xxxxxxxxxxxx> · Fri, 28 Feb 2025 20:09:26 +0000

Hi Laimis,

Thank you for the suggestion. I issued ceph pg repair to all inconsistent PGs but so far nothing changed. Are deep scrubs even going to start with this much recovery in progress?

We are currently using balanced as the osd_mclock_profile but I'm considering changing it to high_recovery_op to increase recovery speed, we have no problem in lower IO to clients at this moment. Would that have any adverse effects?

Kind regards
Gustavo

________________________________
From: Laimis Juzeliūnas <laimis.juzeliunas@xxxxxxxxxx>
Sent: Friday, February 28, 2025 10:12 AM
To: Gustavo Garcia Rondina <grondina@xxxxxxxxxxxx>
Cc: ceph-users <ceph-users@xxxxxxx>
Subject: Re:  Replace OSD while cluster is recovering?

Hi Gustavo,

Focua on fixing the inconsistent pgs, either via deep scrub or with specifically telling the cluster to repair them. Once that is done you are good to go with more riskier operations.

However if the OSD is already out of the cluster all the recovery operations for data are already underway and swapping the OSD with a working one should not cause damage. Just make sure the old one is purged and replaced with a new one with a new ID.

Best,
Laimis J.

On Fri, Feb 28, 2025, 17:58 Gustavo Garcia Rondina <grondina@xxxxxxxxxxxx<mailto:grondina@xxxxxxxxxxxx>> wrote:
Hello list,

We have a Ceph cluster (17.2.6 quincy) with 2 admin nodes and 6 storage nodes, each storage node connected to a JBOD enclosure. Each enclosure houses 28 HDD disks of 18 TB size, totaling 168 OSDs. The pool that houses the majority of the data is erasure-coded (4+2). We have recently had one disk failure, which brought one OSD down:

# ceph osd tree | grep down
  2    hdd    16.49579          osd.2         down         0  1.00000

This OSD is out of the cluster, but we haven't replaced it physically yet. The problem that we are facing is that the cluster was not in the best shape when this OSD failed. Currently we have the following:

################################################
  cluster:
    id:     <redacted>
    health: HEALTH_ERR
            1026 scrub errors
            Possible data damage: 18 pgs inconsistent
            2137 pgs not deep-scrubbed in time
            2137 pgs not scrubbed in time

  services:
    mon: 5 daemons, quorum xyz-admin1,xyz-admin2,xyz-osd1,xyz-osd2,xyz-osd3 (age 17M)
    mgr: xyz-admin2.sipadf(active, since 17M), standbys: xyz-admin1.nwaovh
    mds: 2/2 daemons up, 2 standby
    osd: 168 osds: 167 up (since 44h), 167 in (since 6w); 220 remapped pgs

  data:
    volumes: 2/2 healthy
    pools:   9 pools, 2137 pgs
    objects: 448.54M objects, 1.0 PiB
    usage:   1.6 PiB used, 1.1 PiB / 2.7 PiB avail
    pgs:     134404830/2676514497 objects misplaced (5.022%)
             1902 active+clean
             191  active+remapped+backfilling
             26   active+remapped+backfill_wait
             15   active+clean+inconsistent
             2    active+remapped+inconsistent+backfilling
             1    active+remapped+inconsistent+backfill_wait

  io:
    recovery: 597 MiB/s, 252 objects/s

  progress:
    Global Recovery Event (6w)
      [=========================...] (remaining: 5d)
################################################

I have noticed the number of active+clean increasing (was ~1750 two days ago), and objects misplaced very slowly decreasing. My question is, should I wait until recovery is complete, then repair the 18 damaged pg, and only then replace the disk? My thinking is that replacing the disk will trigger more backfilling which will slow down the recovering even more.

Another question, should I disable scrubbing while the recovery is not finalized?

Thank you for any insights you may be able to provide!
-
Gustavo
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx