Re: replace osd with Octopus

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

maybe a further alternative.

With our support contract I get exact replacements. I found out that doing an off-line copy of a still readable OSD with ddrescue speeds things up dramatically and avoids extended periods of degraded PGs.

Situation and what I did:

I had a disk with repeated deep scrub errors and checking with smartctl I could see that it started remapping sectors. This showed up as PG scrub error. I initiated a full deep scrub of the disk and run PG repair on every PG that was marked as having errors. This way, ceph rewrites the broken object and the disk writes it to a remapped, that is, healthy sector. Doing this a couple of times will leave you with a disk that is 100% readable.

I then shut the OSD down. This lead to recovery IO as expected and after less than 2 hours everything was rebuilt to full redundancy (it was probably faster, I only checked after 2 hours). Recovery from single disk fail is very fast due to all-to-all rebuild.

In the mean time, I did a full disk copy with ddrescue to a large file system space I have on a copy station. Took 16h for a 12TB drive. Right after this, the replacement arrived and I copied the image back. Another 16h.

After this, I simply inserted the new disk with the 5 days old OSD copy and brought it up (there was a weekend in between). Almost all objects on the drive were still up-to-date and after just 30 minutes all PGs were active and clean. Nothing remapped or misplaced any more.

For comparison, I once added a single drive and it took 2 weeks for the affected PGs to be active+clean again. The off-line copy can use much more aggressive and effective IO to a single drive than ceph rebalancing ever would.

For single-disk exchange on our service contract I will probably continue with the ddrescue method even though it requires manual action.

For the future I plan to adapt a different strategy to utilize the all-to-all copy capability of ceph. Exchanging single disks seems not to be a good way to run ceph. I will rather have a larger amount of disks act as hot spares. For example, having enough capacity that one can tolerate loosing 10% of all disks before replacing anything. Adding a large number of disks is overall more effective as it will basically take the same time to get back to health OK as exchanging a single disk.

With my timings, this "replace many disks not single ones" will amortise if at least 5-6 drives failed and are down+out. It will also limit writes to degraded PGs to the shortest interval possible.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
Sent: 28 November 2020 05:55:06
To: Tony Liu
Cc: ceph-users@xxxxxxx
Subject:  Re: replace osd with Octopus

>>
>
> Here is the context.
> https://docs.ceph.com/en/latest/mgr/orchestrator/#replace-an-osd
>
> When disk is broken,
> 1) orch osd rm <svc_id(s)> --replace [--force]
> 2) Replace disk.
> 3) ceph orch apply osd -i <osd_spec_file>
>
> Step #1 marks OSD "destroyed". I assume it has the same effect as
> "ceph osd destroy". And that keeps OSD "in", no PG remapping and
> cluster is in "degrade" state.
>
> After step #3, OSD will be "up" and "in", data will be recovered
> back to new disk. Is that right?

Yes.

> Is cluster "degrade" or "healthy" during such recovery?

It will be degraded, because there are fewer copies of some data available than during normal operation.  Clients will continue to access all data.

> For another option, the difference is no "--replace" in step #1.
> 1) orch osd rm <svc_id(s)> [--force]
> 2) Replace disk.
> 3) ceph orch apply osd -i <osd_spec_file>
>
> Step #1 evacuates PGs from OSD and removes it from cluster.
> If disk is broken or OSD daemon is down, is this evacuation still
> going to work?

Yes, of course — broken drives are the typical reason for removing OSDs.

> Is it going to take a while if there is lots data on this disk?

Yes, depending on what “a while” means to you, the size of the cluster, whether the pool is replicated or EC, and whether these are HDDs or SSDs.

> After step #3, PGs will be rebalanced/remapped again when new OSD
> joins the cluster.
>
> I think, to replace with the same disk model, option #1 is preferred,
> to replace with different disk model, it needs to be option #2.

I haven’t tried it under Octopus, but I don’t think this is strictly true.  If you replace it with a different model that is approximately the same size, everything will be fine.  Through Luminous and I think Nautilus at least, if you `destroy` and replace with a larger drive, the CRUSH weight of the OSD will still reflect that of the old drive.  You could then run `ceph osd crush reweight` after deploying to adjust the size.  You could record the CRUSH weights of all your drive models for initial OSD deploys, or you could `ceph osd tree` and look for another OSD of the same model, and set the CRUSH weight accordingly.

If you replace with a smaller drive, your cluster will lose a small amount of usable capacity.  If you replace with a larger drive, the cluster may or may not enjoy a slight increase in capacity — that depends on replication strategy, rack/host weights, etc.

My personal philosophy on drive replacements:

o Build OSDs with `—dmcrypt` so that you don’t have to worry about data if/when you RMA or recycle bad drives.  RMAs are a hassle, so pick a certain value threshold before a drive is worth the effort.  This might be in the $250-500 range for example, which means that for many HDDs it isn’t worth RMAing them.

o If you have an exact replacement, use it

o When buying spares, buy the largest size drive you have deployed — or will deploy within the next year or so.  That way you know that your spares can take the place of any drive you have, so you don’t have to maintain stock of more than one size. Worst case you don’t immediately make good use of that extra capacity, but you may in the future as drives in other failure domains fail and are replaced.  Be careful, though of mixing  drives that a lot different in size.  Mixing 12 and 14 TB drives, even 12 and 16 is usually no big deal, but if you mix say 1TB and 16 TB drives, you can end up exceeding `mon_max_pg_per_osd`.  Which is one reason why I like to increase it from the default value to, say, 400.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux