Re: Procedure for temporary evacuation and replacement

Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> · Thu, 10 Oct 2024 09:31:50 -0400

I dont think your plan will work as expected.

In step 3 you will introduce additional data movement with the manner in
which you have tried to accomplish this.

I suggest you do set the CRUSH weight to 0 for the OSD in which you intend
to replace; do this for all OSDs you wish to replace whilst the
"norebalance" flag is set.
Now you can optionally run a tool like upmap-remapped.py to reduce your
misplaced percent count ( i wont elaborate on this too much since its
optional )
Once you get to misplaced PG count of 0 you want to "destroy" the OSD "ceph
osd destroy". IMO "destroy" would have been better named
"pending-replacement"
At this point set the norebalance flag again and recreate the OSD
specifying the re-use of the OSD ID and potentially if a multi-disk osd the
block.db and block.wal LVs of the OSD that were previously in use.
Once all OSDs are re-added, set the CRUSH weight of the replaced OSDs to
their new value based upon the new disk size if they didn't get updated in
the previous step.
Optionally run the upmap-remapped.py script again.
Unset the norebalance flag.
Wait for all PGs to become active+clean again.

Respectfully,

*Wes Dillingham*
LinkedIn <http://www.linkedin.com/in/wesleydillingham>
wes@xxxxxxxxxxxxxxxxx

On Thu, Oct 10, 2024 at 9:08 AM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:

>
> >
> > We need to replace about 40 disks distributed over all 12 hosts backing
> a large pool with EC 8+3. We can't do it host by host as it would take way
> too long (replace disks per host and let recovery rebuild the data)
>
> <soapbox>This is one of the false economies of HDDs ;) </soapbox>
>
> > Therefore, we would like to evacuate all data from these disks
> simultaneously and with as little data movement as possible. This is the
> procedure that seems to do the trick:
> >
> > 1.) For all OSDs: ceph osd reweight ID 0  # Note: not "osd crush
> reweight"
>
> Note that this will run afoul of the balancer module.  I *think* also that
> it will result in the data moving to OSDs on the same host.
>
> > 2.) Wait for rebalance to finish
> > 3.) Replace disks and deploy OSDs with the same IDs as before per host
> > 4.) Start OSDs and let rebalance back
> >
> > I tested step 1 on Octopus with 1 disk and it seems to work. The reason
> I ask is that step 1 actually marks the OSDs as OUT. However, they are
> still UP and I see only misplaced objects, not degraded objects. It is a
> bit counter-intuitive, but it seems that UP+OUT OSDs still participate in
> IO.
> >
> > Because it is counter-intuitive, I would like to have a second opinion.
> I have read before that others reweight to something like 0.001 and hope
> that this flushes all PGs. I would prefer not to rely on hope and a
> reweight to 0 apparently is a valid choice here, leading to a somewhat
> weird state with UP+OUT OSDs.
> >
> > Problems that could arise are timeouts I'm overlooking that will make
> data chunks on UP+OUT OSDs unavailable after some time. I'm also wondering
> if UP+OUT OSDs participate in peering in case there is an OSD restart
> somewhere in the pool.
> >
> > Thanks for your input and best regards!
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx