Re: replace osd with Octopus

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Fri, 27 Nov 2020 20:55:06 -0800

>> 
> 
> Here is the context.
> https://docs.ceph.com/en/latest/mgr/orchestrator/#replace-an-osd
> 
> When disk is broken,
> 1) orch osd rm <svc_id(s)> --replace [--force]
> 2) Replace disk.
> 3) ceph orch apply osd -i <osd_spec_file>
> 
> Step #1 marks OSD "destroyed". I assume it has the same effect as
> "ceph osd destroy". And that keeps OSD "in", no PG remapping and
> cluster is in "degrade" state.
> 
> After step #3, OSD will be "up" and "in", data will be recovered
> back to new disk. Is that right?

Yes.

> Is cluster "degrade" or "healthy" during such recovery?

It will be degraded, because there are fewer copies of some data available than during normal operation.  Clients will continue to access all data.

> For another option, the difference is no "--replace" in step #1.
> 1) orch osd rm <svc_id(s)> [--force]
> 2) Replace disk.
> 3) ceph orch apply osd -i <osd_spec_file>
> 
> Step #1 evacuates PGs from OSD and removes it from cluster.
> If disk is broken or OSD daemon is down, is this evacuation still
> going to work?

Yes, of course — broken drives are the typical reason for removing OSDs.

> Is it going to take a while if there is lots data on this disk?

Yes, depending on what “a while” means to you, the size of the cluster, whether the pool is replicated or EC, and whether these are HDDs or SSDs.

> After step #3, PGs will be rebalanced/remapped again when new OSD
> joins the cluster.
> 
> I think, to replace with the same disk model, option #1 is preferred,
> to replace with different disk model, it needs to be option #2.

I haven’t tried it under Octopus, but I don’t think this is strictly true.  If you replace it with a different model that is approximately the same size, everything will be fine.  Through Luminous and I think Nautilus at least, if you `destroy` and replace with a larger drive, the CRUSH weight of the OSD will still reflect that of the old drive.  You could then run `ceph osd crush reweight` after deploying to adjust the size.  You could record the CRUSH weights of all your drive models for initial OSD deploys, or you could `ceph osd tree` and look for another OSD of the same model, and set the CRUSH weight accordingly.

If you replace with a smaller drive, your cluster will lose a small amount of usable capacity.  If you replace with a larger drive, the cluster may or may not enjoy a slight increase in capacity — that depends on replication strategy, rack/host weights, etc.

My personal philosophy on drive replacements:

o Build OSDs with `—dmcrypt` so that you don’t have to worry about data if/when you RMA or recycle bad drives.  RMAs are a hassle, so pick a certain value threshold before a drive is worth the effort.  This might be in the $250-500 range for example, which means that for many HDDs it isn’t worth RMAing them.

o If you have an exact replacement, use it

o When buying spares, buy the largest size drive you have deployed — or will deploy within the next year or so.  That way you know that your spares can take the place of any drive you have, so you don’t have to maintain stock of more than one size. Worst case you don’t immediately make good use of that extra capacity, but you may in the future as drives in other failure domains fail and are replaced.  Be careful, though of mixing  drives that a lot different in size.  Mixing 12 and 14 TB drives, even 12 and 16 is usually no big deal, but if you mix say 1TB and 16 TB drives, you can end up exceeding `mon_max_pg_per_osd`.  Which is one reason why I like to increase it from the default value to, say, 400.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx