Re: replace osd with Octopus

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> >> When replacing an osd, there will be no PG remapping, and backfill
> >>> will restore the data on the new disk, right?
> >>
> >> That depends on how you decide to go through the replacement process.
> >> Usually without your intervention (e.g. setting the appropriate OSD
> >> flags) the remapping will happen after an OSD goes down and out.
> >
> > This has been unclear to me. Is OSD going to be marked out and PGs
> > going to be remapped during replacing? Or it depends on process?
> >
> > When mark an OSD out, remapping will happen and it will take some time
> > for data migration. Is cluster in degraded state during such duration?
> 
> If you set the `noout` flag on the affected OSDs or the entire cluster,
> there won’t be remapping.
> 
> If the OSD fails and is marked `out`, there will be remapping and
> balancing.

Here is the context.
https://docs.ceph.com/en/latest/mgr/orchestrator/#replace-an-osd

When disk is broken,
1) orch osd rm <svc_id(s)> --replace [--force]
2) Replace disk.
3) ceph orch apply osd -i <osd_spec_file>

Step #1 marks OSD "destroyed". I assume it has the same effect as
"ceph osd destroy". And that keeps OSD "in", no PG remapping and
cluster is in "degrade" state.

After step #3, OSD will be "up" and "in", data will be recovered
back to new disk. Is that right?
Is cluster "degrade" or "healthy" during such recovery?

For another option, the difference is no "--replace" in step #1.
1) orch osd rm <svc_id(s)> [--force]
2) Replace disk.
3) ceph orch apply osd -i <osd_spec_file>

Step #1 evacuates PGs from OSD and removes it from cluster.
If disk is broken or OSD daemon is down, is this evacuation still
going to work?
Is it going to take a while if there is lots data on this disk?

After step #3, PGs will be rebalanced/remapped again when new OSD
joins the cluster.

I think, to replace with the same disk model, option #1 is preferred,
to replace with different disk model, it needs to be option #2.
Am I right? Any comments is welcome.

> > My understanding is that, remapping only happens when the OSD is
> > marked out.
> 
> CRUSH topology and rule changes can result in misplaced object too, but
> that’s a tangent.
> 
> > Replacement process will keep OSD always in, assuming replacing with
> > the same disk model.
> 
> `ceph osd destroy` is your friend.
> 
> > In case to replace with different size, it could be more complicated,
> > because weight has to be adjusted for size change and PG may be
> > rebalanced.
> 
> If you replace an OSD with a drive of a different size, and you do so in
> a way such that the CRUSH weight is changed to match, then yes almost
> certainly some PG acting sets will change.
> 
> 
> >>> The key here is how much time backfilling and rebalancing will take?
> >>> The intention is to not keep cluster in degraded state for too long.
> >>> I assume they are similar, because either of them is to copy the
> >>> same amount of data?
> >>> If that's true, then option #2 is pointless.
> >>> Could anyone share such experiences, like how long time it takes to
> >>> recover how much data on what kind of networking/computing env?
> >>
> >> No, option 2 is not pointless, it helps you prevent a degraded state.
> >> Having a small cluster or crush rules that only allow few failed OSDs
> >> it could be dangerous taking out an entire node, risking another
> >> failure and potential data loss. It highly depends on your specific
> >> setup and if you're willing to take the risk during rebuild of a node.
> 
> Agreed.  Overlapping failures can and do happen.  The flipside is that
> if one lets recovery complete, there has to be enough unused capacity in
> the right places to accomodate new data replicas.
> 
> If, say, the affected cluster is on a different continent and you don’t
> have trustworthy 24x7 remote hands, then it could take some time to
> replace a failed drive or node.  In this case, it likely is advantageous
> to let the cluster recover.
> 
> If however you can get the affected drive / node back faster than
> recovery would take, it can be advantageous to prevent recovery until
> the OSDs are back up.  Either way, Ceph has to create data replicas from
> survivors.  *If* you can replace a drive immediately, then there’s no
> extra risk and you can cut data movement very roughly in half.
> 
> This ties into the `mon_osd_down_out_subtree_limit` setting.  Depending
> on one’s topology, it can prevent a thundering herd of recovery, with
> the idea that it’s often faster to get a node back up than it would be
> to recover all that data.  This also avoids surviving OSDs potentially
> becoming full, but one has to have good monitoring so that this state
> does not continue indefinitely.
> 
> Basically, any time PGs are undersized, there’s risk of an overlapping
> failure.  The best course is often a question of which strategy will get
> them back to full size.  Remapped PGs aren’t so big a deal, because at
> all times you have the desired number of replicas.
> 
> >> The recovery/backfill speed is also depeneding on the size of OSDs,
> >> the object sizes, amount of data, etc.
> 
> It’s also a function of HDD vs SSD, replication vs EC, whether omaps are
> significantly involved, throttle settings, cluster size and topology,
> etc.
> 
> 
> >> You would probably need to search the mailing list for examples from
> >> someone sharing their experience or so, I don't have captured such
> >> statistics.
> >
> > My conclusion was based on two assumptions, correct me if they are
> wrong.
> > 1) cluster is degraded during remapping.
> 
> Be careful what you consider “degraded”.
> 
> > 2) no remapping when recovering an OSD.
> >
> > For option #1, no remapping, just degraded state during recovering.
> > For option #2, remapping twice, one is to remap PG from old OSD to
> > others, another is to remap PG again when new OSD is in place.
> > It seems degrade state is twice longer with option #2 than #1.
> > Is that right?
> 
> It can be, depending on your topology.
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux