Re: replace osd with Octopus

Frank Schilder <frans@xxxxxx> · Wed, 2 Dec 2020 18:20:00 +0000

> I must be missing something seriously:)

Yes. And I think its time that you actually try it out instead of writing ever longer e-mails.

If you re-read the e-mail correspondence carefully, you should notice that your follow-up questions have been answered already.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Tony Liu <tonyliu0592@xxxxxxxxxxx>
Sent: 02 December 2020 19:00:18
To: Frank Schilder; Anthony D'Atri
Cc: ceph-users@xxxxxxx
Subject: RE:  Re: replace osd with Octopus

> > A dummy question, what's this all-to-all rebuild/copy?
> > Is that PG remapping when the broken disk is taken out?
>
> - all-to-all: every OSD sends/receives objects to/from every other OSD
> - one-to-all: one OSD sends objects to all other OSDs
> - all-to-one: all other OSDs send objects to one OSD
>
> All-to-all happens if one disk fails and all other OSDs rebuild the
> missing data. This is very fast.

No matter "up" or "down", PG mapping remains when OSD is "in",
PGs will be remapped when OSD is "out". Is that correct?

Since "other OSDs rebuild the missing data", there must be PG
remapping and the failed disk is "out" by either manual or automatic.
Right?

> One-to-all happens when you evacuate a single disk, for example, by
> setting its weight to 0. This is very slow. It is faster to just fail
> the disk and let the data rebuild, however, with the drawback of
> temporarily reduced redundancy.

>From what I see, the difference between "evacuate a single disk" and
"a disk fails" is cluster state. When "evacuate a single disk",
cluster is healthy because all replicas are available.
When "a disk fails", cluster is degraded, because one replica is
missing. In terms of PG remapping, it happens either way. I see the
same copy happening in background, PGs on failed/evacuated disk are
copied to other disks. If that's true, why there is dramatic timing
difference for those two cases?

Give my above understanding, all-to-all is no difference from
one-to-all. In either case, PGs of one disk are remapped to others.

I must be missing something seriously:)

> All-to-one happens when you add a single disk and all other OSDs send it
> its data. This is also very slow and there is no short-cut.

Add a new disk will cause PGs to be rebalanced. It will take times.
But for replacing disk (OSD keeps being "in".), since PG mapping
remains, no rebalance/remapping, just copy data back.

> Conclusion: design work flows that utilize the all-to-all capability of
> ceph as much as possible. For example, plan the cluster such that
> single-disk operations can be avoided.
>
> > In your case, does "shut the OSD down" mark OSD "out"?
> > "rebuilt to full redundancy" took 2 hours (I assume there was PG
> > remapping.)? What's the disk size?
>
> If you stop an OSD, it will be down and 5 minutes later marked out
> (auto-out). These time-outs can be configured. Size was 12TB (10.7TiB).
> Its NL-SAS drives.
>
> > Regarding to your future plan relying on all-to-all copy, "with large
> > amount of hot spares", I assume you mean large amount of spare spaces?
> > What do you do when a disk fails?
> > Just take it out and let the cluster heal itself by remapping PGs from
> > failed disk to spare spaces?
>
> Hot spares means that you deploy 5-10% more disks than you need to
> provide the requested capacity (hot means they are already part of the
> cluster, otherwise they would be called cold spares). Then, if a single
> disk fails, you do nothing, because you still have excess capacity. Only
> after all the 5-10% extra disks have failed will 5-10% disks be added
> again as new. In fact, I would plan it such that this replacement falls
> together with the next capacity extension. Then, you simply do nothing
> when a disk fails - except maybe taking it out and requesting a
> replacement if your contract provides that (put it on a shelf until next
> cluster extension).

Is hot spare disk "in" the cluster and allocated with PGs?
If yes, what's the difference between hot spare disk and normal disks?

My understanding is that, just keep cluster capacity under a reasonable
threshold, to accommodate one or couple disks failure. Since the
cluster will heal itself, no rush to replace the disk when failure
happens. And when replacing the disk, it will be the same as adding
a new disk. This is my original option #2. I was just not sure about
how much time the cluster will take to heal itself. Based on your
experiences, it's pretty fast, couple hours to rebuild 10T data.

> Doubling the number of OSDs in a storage extension operation will
> practically result in all-to-all data movement. Its theoretically half-
> to-half, but more than 50% of objects are usually misplaced and there
> will be movement between the original set of OSDs as well. In any case,
> getting such a large number of disks involved that only need to be
> filled up to 50% of the previous capacity will be much more efficient
> (in administrator workload/salary) than doing single-disk replacements
> or tiny extensions.
>
> Ceph is fun if its big enough :)

Definitely!

>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Tony Liu <tonyliu0592@xxxxxxxxxxx>
> Sent: 02 December 2020 05:48:10
> To: Frank Schilder; Anthony D'Atri
> Cc: ceph-users@xxxxxxx
> Subject: RE:  Re: replace osd with Octopus
>
> Hi Frank,
>
> A dummy question, what's this all-to-all rebuild/copy?
> Is that PG remapping when the broken disk is taken out?
>
> In your case, does "shut the OSD down" mark OSD "out"?
> "rebuilt to full redundancy" took 2 hours (I assume there was PG
> remapping.)? What's the disk size?
>
> Regarding to your future plan relying on all-to-all copy, "with large
> amount of hot spares", I assume you mean large amount of spare spaces?
> What do you do when a disk fails?
> Just take it out and let the cluster heal itself by remapping PGs from
> failed disk to spare spaces?
>
>
> Thanks!
> Tony
> > -----Original Message-----
> > From: Frank Schilder <frans@xxxxxx>
> > Sent: Saturday, November 28, 2020 12:42 AM
> > To: Anthony D'Atri <anthony.datri@xxxxxxxxx>; Tony Liu
> > <tonyliu0592@xxxxxxxxxxx>
> > Cc: ceph-users@xxxxxxx
> > Subject: Re:  Re: replace osd with Octopus
> >
> > Hi all,
> >
> > maybe a further alternative.
> >
> > With our support contract I get exact replacements. I found out that
> > doing an off-line copy of a still readable OSD with ddrescue speeds
> > things up dramatically and avoids extended periods of degraded PGs.
> >
> > Situation and what I did:
> >
> > I had a disk with repeated deep scrub errors and checking with
> > smartctl I could see that it started remapping sectors. This showed up
> > as PG scrub error. I initiated a full deep scrub of the disk and run
> > PG repair on every PG that was marked as having errors. This way, ceph
> > rewrites the broken object and the disk writes it to a remapped, that
> > is, healthy sector. Doing this a couple of times will leave you with a
> > disk that is 100% readable.
> >
> > I then shut the OSD down. This lead to recovery IO as expected and
> > after less than 2 hours everything was rebuilt to full redundancy (it
> > was probably faster, I only checked after 2 hours). Recovery from
> > single disk fail is very fast due to all-to-all rebuild.
> >
> > In the mean time, I did a full disk copy with ddrescue to a large file
> > system space I have on a copy station. Took 16h for a 12TB drive.
> > Right after this, the replacement arrived and I copied the image back.
> > Another 16h.
> >
> > After this, I simply inserted the new disk with the 5 days old OSD
> > copy and brought it up (there was a weekend in between). Almost all
> > objects on the drive were still up-to-date and after just 30 minutes
> > all PGs were active and clean. Nothing remapped or misplaced any more.
> >
> > For comparison, I once added a single drive and it took 2 weeks for
> > the affected PGs to be active+clean again. The off-line copy can use
> > much more aggressive and effective IO to a single drive than ceph
> > rebalancing ever would.
> >
> > For single-disk exchange on our service contract I will probably
> > continue with the ddrescue method even though it requires manual
> action.
> >
> > For the future I plan to adapt a different strategy to utilize the
> > all- to-all copy capability of ceph. Exchanging single disks seems not
> > to be a good way to run ceph. I will rather have a larger amount of
> > disks act as hot spares. For example, having enough capacity that one
> > can tolerate loosing 10% of all disks before replacing anything.
> > Adding a large number of disks is overall more effective as it will
> > basically take the same time to get back to health OK as exchanging a
> single disk.
> >
> > With my timings, this "replace many disks not single ones" will
> > amortise if at least 5-6 drives failed and are down+out. It will also
> > limit writes to degraded PGs to the shortest interval possible.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
> > Sent: 28 November 2020 05:55:06
> > To: Tony Liu
> > Cc: ceph-users@xxxxxxx
> > Subject:  Re: replace osd with Octopus
> >
> > >>
> > >
> > > Here is the context.
> > > https://docs.ceph.com/en/latest/mgr/orchestrator/#replace-an-osd
> > >
> > > When disk is broken,
> > > 1) orch osd rm <svc_id(s)> --replace [--force]
> > > 2) Replace disk.
> > > 3) ceph orch apply osd -i <osd_spec_file>
> > >
> > > Step #1 marks OSD "destroyed". I assume it has the same effect as
> > > "ceph osd destroy". And that keeps OSD "in", no PG remapping and
> > > cluster is in "degrade" state.
> > >
> > > After step #3, OSD will be "up" and "in", data will be recovered
> > > back to new disk. Is that right?
> >
> > Yes.
> >
> > > Is cluster "degrade" or "healthy" during such recovery?
> >
> > It will be degraded, because there are fewer copies of some data
> > available than during normal operation.  Clients will continue to
> > access all data.
> >
> > > For another option, the difference is no "--replace" in step #1.
> > > 1) orch osd rm <svc_id(s)> [--force]
> > > 2) Replace disk.
> > > 3) ceph orch apply osd -i <osd_spec_file>
> > >
> > > Step #1 evacuates PGs from OSD and removes it from cluster.
> > > If disk is broken or OSD daemon is down, is this evacuation still
> > > going to work?
> >
> > Yes, of course - broken drives are the typical reason for removing
> OSDs.
> >
> > > Is it going to take a while if there is lots data on this disk?
> >
> > Yes, depending on what "a while" means to you, the size of the
> > cluster, whether the pool is replicated or EC, and whether these are
> HDDs or SSDs.
> >
> > > After step #3, PGs will be rebalanced/remapped again when new OSD
> > > joins the cluster.
> > >
> > > I think, to replace with the same disk model, option #1 is
> > > preferred, to replace with different disk model, it needs to be
> option #2.
> >
> > I haven't tried it under Octopus, but I don't think this is strictly
> > true.  If you replace it with a different model that is approximately
> > the same size, everything will be fine.  Through Luminous and I think
> > Nautilus at least, if you `destroy` and replace with a larger drive,
> > the CRUSH weight of the OSD will still reflect that of the old drive.
> > You could then run `ceph osd crush reweight` after deploying to adjust
> > the size.  You could record the CRUSH weights of all your drive models
> > for initial OSD deploys, or you could `ceph osd tree` and look for
> > another OSD of the same model, and set the CRUSH weight accordingly.
> >
> > If you replace with a smaller drive, your cluster will lose a small
> > amount of usable capacity.  If you replace with a larger drive, the
> > cluster may or may not enjoy a slight increase in capacity - that
> > depends on replication strategy, rack/host weights, etc.
> >
> > My personal philosophy on drive replacements:
> >
> > o Build OSDs with `-dmcrypt` so that you don't have to worry about
> > data if/when you RMA or recycle bad drives.  RMAs are a hassle, so
> > pick a certain value threshold before a drive is worth the effort.
> > This might be in the $250-500 range for example, which means that for
> > many HDDs it isn't worth RMAing them.
> >
> > o If you have an exact replacement, use it
> >
> > o When buying spares, buy the largest size drive you have deployed -
> > or will deploy within the next year or so.  That way you know that
> > your spares can take the place of any drive you have, so you don't
> > have to maintain stock of more than one size. Worst case you don't
> > immediately make good use of that extra capacity, but you may in the
> > future as drives in other failure domains fail and are replaced.  Be
> > careful, though of mixing  drives that a lot different in size.
> > Mixing 12 and 14 TB drives, even 12 and 16 is usually no big deal, but
> > if you mix say 1TB and 16 TB drives, you can end up exceeding
> > `mon_max_pg_per_osd`.  Which is one reason why I like to increase it
> > from the default value to, say, 400.
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> > email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx