Re: replace osd with Octopus

Frank Schilder <frans@xxxxxx> · Wed, 2 Dec 2020 16:22:02 +0000

> A dummy question, what's this all-to-all rebuild/copy?
> Is that PG remapping when the broken disk is taken out?

- all-to-all: every OSD sends/receives objects to/from every other OSD
- one-to-all: one OSD sends objects to all other OSDs
- all-to-one: all other OSDs send objects to one OSD

All-to-all happens if one disk fails and all other OSDs rebuild the missing data. This is very fast.

One-to-all happens when you evacuate a single disk, for example, by setting its weight to 0. This is very slow. It is faster to just fail the disk and let the data rebuild, however, with the drawback of temporarily reduced redundancy.

All-to-one happens when you add a single disk and all other OSDs send it its data. This is also very slow and there is no short-cut.

Conclusion: design work flows that utilize the all-to-all capability of ceph as much as possible. For example, plan the cluster such that single-disk operations can be avoided.

> In your case, does "shut the OSD down" mark OSD "out"?
> "rebuilt to full redundancy" took 2 hours (I assume there was
> PG remapping.)? What's the disk size?

If you stop an OSD, it will be down and 5 minutes later marked out (auto-out). These time-outs can be configured. Size was 12TB (10.7TiB). Its NL-SAS drives.

> Regarding to your future plan relying on all-to-all copy,
> "with large amount of hot spares", I assume you mean large
> amount of spare spaces? What do you do when a disk fails?
> Just take it out and let the cluster heal itself by remapping
> PGs from failed disk to spare spaces?

Hot spares means that you deploy 5-10% more disks than you need to provide the requested capacity (hot means they are already part of the cluster, otherwise they would be called cold spares). Then, if a single disk fails, you do nothing, because you still have excess capacity. Only after all the 5-10% extra disks have failed will 5-10% disks be added again as new. In fact, I would plan it such that this replacement falls together with the next capacity extension. Then, you simply do nothing when a disk fails - except maybe taking it out and requesting a replacement if your contract provides that (put it on a shelf until next cluster extension).

Doubling the number of OSDs in a storage extension operation will practically result in all-to-all data movement. Its theoretically half-to-half, but more than 50% of objects are usually misplaced and there will be movement between the original set of OSDs as well. In any case, getting such a large number of disks involved that only need to be filled up to 50% of the previous capacity will be much more efficient (in administrator workload/salary) than doing single-disk replacements or tiny extensions.

Ceph is fun if its big enough :)

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Tony Liu <tonyliu0592@xxxxxxxxxxx>
Sent: 02 December 2020 05:48:10
To: Frank Schilder; Anthony D'Atri
Cc: ceph-users@xxxxxxx
Subject: RE:  Re: replace osd with Octopus

Hi Frank,

A dummy question, what's this all-to-all rebuild/copy?
Is that PG remapping when the broken disk is taken out?

In your case, does "shut the OSD down" mark OSD "out"?
"rebuilt to full redundancy" took 2 hours (I assume there was
PG remapping.)? What's the disk size?

Regarding to your future plan relying on all-to-all copy,
"with large amount of hot spares", I assume you mean large
amount of spare spaces? What do you do when a disk fails?
Just take it out and let the cluster heal itself by remapping
PGs from failed disk to spare spaces?

Thanks!
Tony
> -----Original Message-----
> From: Frank Schilder <frans@xxxxxx>
> Sent: Saturday, November 28, 2020 12:42 AM
> To: Anthony D'Atri <anthony.datri@xxxxxxxxx>; Tony Liu
> <tonyliu0592@xxxxxxxxxxx>
> Cc: ceph-users@xxxxxxx
> Subject: Re:  Re: replace osd with Octopus
>
> Hi all,
>
> maybe a further alternative.
>
> With our support contract I get exact replacements. I found out that
> doing an off-line copy of a still readable OSD with ddrescue speeds
> things up dramatically and avoids extended periods of degraded PGs.
>
> Situation and what I did:
>
> I had a disk with repeated deep scrub errors and checking with smartctl
> I could see that it started remapping sectors. This showed up as PG
> scrub error. I initiated a full deep scrub of the disk and run PG repair
> on every PG that was marked as having errors. This way, ceph rewrites
> the broken object and the disk writes it to a remapped, that is, healthy
> sector. Doing this a couple of times will leave you with a disk that is
> 100% readable.
>
> I then shut the OSD down. This lead to recovery IO as expected and after
> less than 2 hours everything was rebuilt to full redundancy (it was
> probably faster, I only checked after 2 hours). Recovery from single
> disk fail is very fast due to all-to-all rebuild.
>
> In the mean time, I did a full disk copy with ddrescue to a large file
> system space I have on a copy station. Took 16h for a 12TB drive. Right
> after this, the replacement arrived and I copied the image back. Another
> 16h.
>
> After this, I simply inserted the new disk with the 5 days old OSD copy
> and brought it up (there was a weekend in between). Almost all objects
> on the drive were still up-to-date and after just 30 minutes all PGs
> were active and clean. Nothing remapped or misplaced any more.
>
> For comparison, I once added a single drive and it took 2 weeks for the
> affected PGs to be active+clean again. The off-line copy can use much
> more aggressive and effective IO to a single drive than ceph rebalancing
> ever would.
>
> For single-disk exchange on our service contract I will probably
> continue with the ddrescue method even though it requires manual action.
>
> For the future I plan to adapt a different strategy to utilize the all-
> to-all copy capability of ceph. Exchanging single disks seems not to be
> a good way to run ceph. I will rather have a larger amount of disks act
> as hot spares. For example, having enough capacity that one can tolerate
> loosing 10% of all disks before replacing anything. Adding a large
> number of disks is overall more effective as it will basically take the
> same time to get back to health OK as exchanging a single disk.
>
> With my timings, this "replace many disks not single ones" will amortise
> if at least 5-6 drives failed and are down+out. It will also limit
> writes to degraded PGs to the shortest interval possible.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
> Sent: 28 November 2020 05:55:06
> To: Tony Liu
> Cc: ceph-users@xxxxxxx
> Subject:  Re: replace osd with Octopus
>
> >>
> >
> > Here is the context.
> > https://docs.ceph.com/en/latest/mgr/orchestrator/#replace-an-osd
> >
> > When disk is broken,
> > 1) orch osd rm <svc_id(s)> --replace [--force]
> > 2) Replace disk.
> > 3) ceph orch apply osd -i <osd_spec_file>
> >
> > Step #1 marks OSD "destroyed". I assume it has the same effect as
> > "ceph osd destroy". And that keeps OSD "in", no PG remapping and
> > cluster is in "degrade" state.
> >
> > After step #3, OSD will be "up" and "in", data will be recovered back
> > to new disk. Is that right?
>
> Yes.
>
> > Is cluster "degrade" or "healthy" during such recovery?
>
> It will be degraded, because there are fewer copies of some data
> available than during normal operation.  Clients will continue to access
> all data.
>
> > For another option, the difference is no "--replace" in step #1.
> > 1) orch osd rm <svc_id(s)> [--force]
> > 2) Replace disk.
> > 3) ceph orch apply osd -i <osd_spec_file>
> >
> > Step #1 evacuates PGs from OSD and removes it from cluster.
> > If disk is broken or OSD daemon is down, is this evacuation still
> > going to work?
>
> Yes, of course - broken drives are the typical reason for removing OSDs.
>
> > Is it going to take a while if there is lots data on this disk?
>
> Yes, depending on what "a while" means to you, the size of the cluster,
> whether the pool is replicated or EC, and whether these are HDDs or SSDs.
>
> > After step #3, PGs will be rebalanced/remapped again when new OSD
> > joins the cluster.
> >
> > I think, to replace with the same disk model, option #1 is preferred,
> > to replace with different disk model, it needs to be option #2.
>
> I haven't tried it under Octopus, but I don't think this is strictly
> true.  If you replace it with a different model that is approximately
> the same size, everything will be fine.  Through Luminous and I think
> Nautilus at least, if you `destroy` and replace with a larger drive, the
> CRUSH weight of the OSD will still reflect that of the old drive.  You
> could then run `ceph osd crush reweight` after deploying to adjust the
> size.  You could record the CRUSH weights of all your drive models for
> initial OSD deploys, or you could `ceph osd tree` and look for another
> OSD of the same model, and set the CRUSH weight accordingly.
>
> If you replace with a smaller drive, your cluster will lose a small
> amount of usable capacity.  If you replace with a larger drive, the
> cluster may or may not enjoy a slight increase in capacity - that
> depends on replication strategy, rack/host weights, etc.
>
> My personal philosophy on drive replacements:
>
> o Build OSDs with `-dmcrypt` so that you don't have to worry about data
> if/when you RMA or recycle bad drives.  RMAs are a hassle, so pick a
> certain value threshold before a drive is worth the effort.  This might
> be in the $250-500 range for example, which means that for many HDDs it
> isn't worth RMAing them.
>
> o If you have an exact replacement, use it
>
> o When buying spares, buy the largest size drive you have deployed - or
> will deploy within the next year or so.  That way you know that your
> spares can take the place of any drive you have, so you don't have to
> maintain stock of more than one size. Worst case you don't immediately
> make good use of that extra capacity, but you may in the future as
> drives in other failure domains fail and are replaced.  Be careful,
> though of mixing  drives that a lot different in size.  Mixing 12 and 14
> TB drives, even 12 and 16 is usually no big deal, but if you mix say 1TB
> and 16 TB drives, you can end up exceeding `mon_max_pg_per_osd`.  Which
> is one reason why I like to increase it from the default value to, say,
> 400.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx