Re: Replace OSD node without remapping PGs

Eugen Block <eblock@xxxxxx> · Thu, 02 Apr 2020 06:49:54 +0000

Yeah, I should have mentioned the swap-bucket option. We couldn't use  
that because we actually didn't swap anything but moved the old hosts  
to a different root and we keep them for erasure coding pools.

Zitat von Anthony D'Atri <aad@xxxxxxxxxxxxxx>:

The strategy that Nghia described is inefficient for moving data  
more than once, but safe since there are always N copies, vs a  
strategy of setting noout, destroying the OSDs, and recreating them  
on the new server.  That would be more efficient, albeit with a  
period of reduced redundancy.

I’ve done what Eugen describes, slightly differently:
- Create a staging root
- Create a host bucket there with the new nodename
- Create new OSDs, CRUSH weight them to 0
- Move into the production root
- Weight up using your method of choice

Another option would be, if the hardware is compatible, to set  
noout, take down one node, destroy the OSDs, swap in the new  
drives/node, provision the OSDs with the same IDs, and wait for  
balancing.  But you have a period of reduced redundancy, and the  
wrong drive failing can cause grief.

I think, though, that this sort of scenario may be what swap-bucket  
was designed for.

https://docs.ceph.com/docs/mimic/rados/operations/bluestore-migration/

On Apr 1, 2020, at 5:43 AM, Eugen Block <eblock@xxxxxx> wrote:

Hi,

I have a different approach in mind for a replacement, we  
successfully accomplished that last year in our production  
environment where we replaced all nodes of the cluster with newer  
hardware. Of course we wanted to avoid rebalancing the data  
multiple times.

What we did was to create a new "root" bucket in our crush tree  
parallel to the root=default, then we moved the old nodes to the  
new root. This can't trigger rebalancing because there was no host  
available in the default root anymore, but the data was still  
available to the clients as if nothing had changed.
Then we added the new nodes to the default root with initial osd  
crush weight = 0. After all new nodes and OSDs were there we  
increased the weight to start data movement. This way all PGs were  
recreated only once on the new nodes, slowly draining the old  
servers.

This should be a valid approach for a single server, too. Create a  
(temporary) new root or bucket within your crush tree and move the  
old host to that bucket. Then add your new server to the correct  
root with initial osd crush weight = 0. When all OSDs are there,  
increase the weight for all OSDs at once to start the data movement.

This was all in a Luminous cluster.

Regards,
Eugen

Zitat von Nghia Viet Tran <Nghia.Viet.Tran@xxxxxxxxxx>:

Hi everyone,

I'm working on replacing OSDs node with the newer one. The new  
host has the new hostname and new disk (faster one but the same  
size with old disk). My plan is
- Reweight the OSD to zero to spread all existed data to the rest  
nodes to keep data availability
- set flag noout, norebalance, norecover, nobackfill, destroy the  
OSD and join the new OSD as the same ID of the old one.

By above approach, the cluster will remap PGs of all nodes. Each  
data will be moved twice times until it reach the new OSD  
(reweight and join new node as same ID)

I also did the other way that only set flags and destroy OSD. But  
the result is still the same (degraded objects from destroyed osd  
and misplaced object after joining new osd)

Are there any ways to replace the OSD node directly without  
remapping PGs of the whole cluster?

Many thanks!
Nghia.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx