Re: replace osd with Octopus

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Thu, 26 Nov 2020 12:09:59 -0800

>> When replacing an osd, there will be no PG remapping, and backfill
>>> will restore the data on the new disk, right?
>> 
>> That depends on how you decide to go through the replacement process.
>> Usually without your intervention (e.g. setting the appropriate OSD
>> flags) the remapping will happen after an OSD goes down and out.
> 
> This has been unclear to me. Is OSD going to be marked out and PGs
> going to be remapped during replacing? Or it depends on process?
> 
> When mark an OSD out, remapping will happen and it will take some time
> for data migration. Is cluster in degraded state during such duration?

If you set the `noout` flag on the affected OSDs or the entire cluster, there won’t be remapping.

If the OSD fails and is marked `out`, there will be remapping and balancing.

> My understanding is that, remapping only happens when the OSD is marked
> out.

CRUSH topology and rule changes can result in misplaced object too, but that’s a tangent.

> Replacement process will keep OSD always in, assuming replacing
> with the same disk model.

`ceph osd destroy` is your friend.

> In case to replace with different size, it could be more complicated,
> because weight has to be adjusted for size change and PG may be
> rebalanced.

If you replace an OSD with a drive of a different size, and you do so in a way such that the CRUSH weight is changed to match, then yes almost certainly some PG acting sets will change.

>>> The key here is how much time backfilling and rebalancing will take?
>>> The intention is to not keep cluster in degraded state for too long.
>>> I assume they are similar, because either of them is to copy the same
>>> amount of data?
>>> If that's true, then option #2 is pointless.
>>> Could anyone share such experiences, like how long time it takes to
>>> recover how much data on what kind of networking/computing env?
>> 
>> No, option 2 is not pointless, it helps you prevent a degraded state.
>> Having a small cluster or crush rules that only allow few failed OSDs it
>> could be dangerous taking out an entire node, risking another failure
>> and potential data loss. It highly depends on your specific setup and if
>> you're willing to take the risk during rebuild of a node.

Agreed.  Overlapping failures can and do happen.  The flipside is that if one lets recovery complete, there has to be enough unused capacity in the right places to accomodate new data replicas.

If, say, the affected cluster is on a different continent and you don’t have trustworthy 24x7 remote hands, then it could take some time to replace a failed drive or node.  In this case, it likely is advantageous to let the cluster recover.

If however you can get the affected drive / node back faster than recovery would take, it can be advantageous to prevent recovery until the OSDs are back up.  Either way, Ceph has to create data replicas from survivors.  *If* you can replace a drive immediately, then there’s no extra risk and you can cut data movement very roughly in half.

This ties into the `mon_osd_down_out_subtree_limit` setting.  Depending on one’s topology, it can prevent a thundering herd of recovery, with the idea that it’s often faster to get a node back up than it would be to recover all that data.  This also avoids surviving OSDs potentially becoming full, but one has to have good monitoring so that this state does not continue indefinitely.

Basically, any time PGs are undersized, there’s risk of an overlapping failure.  The best course is often a question of which strategy will get them back to full size.  Remapped PGs aren’t so big a deal, because at all times you have the desired number of replicas.

>> The recovery/backfill speed is also depeneding on the size of OSDs, the
>> object sizes, amount of data, etc.

It’s also a function of HDD vs SSD, replication vs EC, whether omaps are significantly involved, throttle settings, cluster size and topology, etc.

>> You would probably need to search the
>> mailing list for examples from someone sharing their experience or so, I
>> don't have captured such statistics.
> 
> My conclusion was based on two assumptions, correct me if they are wrong.
> 1) cluster is degraded during remapping.

Be careful what you consider “degraded”.  

> 2) no remapping when recovering an OSD.
> 
> For option #1, no remapping, just degraded state during recovering.
> For option #2, remapping twice, one is to remap PG from old OSD to
> others, another is to remap PG again when new OSD is in place.
> It seems degrade state is twice longer with option #2 than #1.
> Is that right?

It can be, depending on your topology.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx