Re: replace osd with Octopus

Tony Liu <tonyliu0592@xxxxxxxxxxx> · Wed, 2 Dec 2020 04:48:10 +0000

Hi Frank,

A dummy question, what's this all-to-all rebuild/copy?
Is that PG remapping when the broken disk is taken out?

In your case, does "shut the OSD down" mark OSD "out"?
"rebuilt to full redundancy" took 2 hours (I assume there was
PG remapping.)? What's the disk size?

Regarding to your future plan relying on all-to-all copy,
"with large amount of hot spares", I assume you mean large
amount of spare spaces? What do you do when a disk fails?
Just take it out and let the cluster heal itself by remapping
PGs from failed disk to spare spaces?

Thanks!
Tony
> -----Original Message-----
> From: Frank Schilder <frans@xxxxxx>
> Sent: Saturday, November 28, 2020 12:42 AM
> To: Anthony D'Atri <anthony.datri@xxxxxxxxx>; Tony Liu
> <tonyliu0592@xxxxxxxxxxx>
> Cc: ceph-users@xxxxxxx
> Subject: Re:  Re: replace osd with Octopus
> 
> Hi all,
> 
> maybe a further alternative.
> 
> With our support contract I get exact replacements. I found out that
> doing an off-line copy of a still readable OSD with ddrescue speeds
> things up dramatically and avoids extended periods of degraded PGs.
> 
> Situation and what I did:
> 
> I had a disk with repeated deep scrub errors and checking with smartctl
> I could see that it started remapping sectors. This showed up as PG
> scrub error. I initiated a full deep scrub of the disk and run PG repair
> on every PG that was marked as having errors. This way, ceph rewrites
> the broken object and the disk writes it to a remapped, that is, healthy
> sector. Doing this a couple of times will leave you with a disk that is
> 100% readable.
> 
> I then shut the OSD down. This lead to recovery IO as expected and after
> less than 2 hours everything was rebuilt to full redundancy (it was
> probably faster, I only checked after 2 hours). Recovery from single
> disk fail is very fast due to all-to-all rebuild.
> 
> In the mean time, I did a full disk copy with ddrescue to a large file
> system space I have on a copy station. Took 16h for a 12TB drive. Right
> after this, the replacement arrived and I copied the image back. Another
> 16h.
> 
> After this, I simply inserted the new disk with the 5 days old OSD copy
> and brought it up (there was a weekend in between). Almost all objects
> on the drive were still up-to-date and after just 30 minutes all PGs
> were active and clean. Nothing remapped or misplaced any more.
> 
> For comparison, I once added a single drive and it took 2 weeks for the
> affected PGs to be active+clean again. The off-line copy can use much
> more aggressive and effective IO to a single drive than ceph rebalancing
> ever would.
> 
> For single-disk exchange on our service contract I will probably
> continue with the ddrescue method even though it requires manual action.
> 
> For the future I plan to adapt a different strategy to utilize the all-
> to-all copy capability of ceph. Exchanging single disks seems not to be
> a good way to run ceph. I will rather have a larger amount of disks act
> as hot spares. For example, having enough capacity that one can tolerate
> loosing 10% of all disks before replacing anything. Adding a large
> number of disks is overall more effective as it will basically take the
> same time to get back to health OK as exchanging a single disk.
> 
> With my timings, this "replace many disks not single ones" will amortise
> if at least 5-6 drives failed and are down+out. It will also limit
> writes to degraded PGs to the shortest interval possible.
> 
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: Anthony D'Atri <anthony.datri@xxxxxxxxx>
> Sent: 28 November 2020 05:55:06
> To: Tony Liu
> Cc: ceph-users@xxxxxxx
> Subject:  Re: replace osd with Octopus
> 
> >>
> >
> > Here is the context.
> > https://docs.ceph.com/en/latest/mgr/orchestrator/#replace-an-osd
> >
> > When disk is broken,
> > 1) orch osd rm <svc_id(s)> --replace [--force]
> > 2) Replace disk.
> > 3) ceph orch apply osd -i <osd_spec_file>
> >
> > Step #1 marks OSD "destroyed". I assume it has the same effect as
> > "ceph osd destroy". And that keeps OSD "in", no PG remapping and
> > cluster is in "degrade" state.
> >
> > After step #3, OSD will be "up" and "in", data will be recovered back
> > to new disk. Is that right?
> 
> Yes.
> 
> > Is cluster "degrade" or "healthy" during such recovery?
> 
> It will be degraded, because there are fewer copies of some data
> available than during normal operation.  Clients will continue to access
> all data.
> 
> > For another option, the difference is no "--replace" in step #1.
> > 1) orch osd rm <svc_id(s)> [--force]
> > 2) Replace disk.
> > 3) ceph orch apply osd -i <osd_spec_file>
> >
> > Step #1 evacuates PGs from OSD and removes it from cluster.
> > If disk is broken or OSD daemon is down, is this evacuation still
> > going to work?
> 
> Yes, of course - broken drives are the typical reason for removing OSDs.
> 
> > Is it going to take a while if there is lots data on this disk?
> 
> Yes, depending on what "a while" means to you, the size of the cluster,
> whether the pool is replicated or EC, and whether these are HDDs or SSDs.
> 
> > After step #3, PGs will be rebalanced/remapped again when new OSD
> > joins the cluster.
> >
> > I think, to replace with the same disk model, option #1 is preferred,
> > to replace with different disk model, it needs to be option #2.
> 
> I haven't tried it under Octopus, but I don't think this is strictly
> true.  If you replace it with a different model that is approximately
> the same size, everything will be fine.  Through Luminous and I think
> Nautilus at least, if you `destroy` and replace with a larger drive, the
> CRUSH weight of the OSD will still reflect that of the old drive.  You
> could then run `ceph osd crush reweight` after deploying to adjust the
> size.  You could record the CRUSH weights of all your drive models for
> initial OSD deploys, or you could `ceph osd tree` and look for another
> OSD of the same model, and set the CRUSH weight accordingly.
> 
> If you replace with a smaller drive, your cluster will lose a small
> amount of usable capacity.  If you replace with a larger drive, the
> cluster may or may not enjoy a slight increase in capacity - that
> depends on replication strategy, rack/host weights, etc.
> 
> My personal philosophy on drive replacements:
> 
> o Build OSDs with `-dmcrypt` so that you don't have to worry about data
> if/when you RMA or recycle bad drives.  RMAs are a hassle, so pick a
> certain value threshold before a drive is worth the effort.  This might
> be in the $250-500 range for example, which means that for many HDDs it
> isn't worth RMAing them.
> 
> o If you have an exact replacement, use it
> 
> o When buying spares, buy the largest size drive you have deployed - or
> will deploy within the next year or so.  That way you know that your
> spares can take the place of any drive you have, so you don't have to
> maintain stock of more than one size. Worst case you don't immediately
> make good use of that extra capacity, but you may in the future as
> drives in other failure domains fail and are replaced.  Be careful,
> though of mixing  drives that a lot different in size.  Mixing 12 and 14
> TB drives, even 12 and 16 is usually no big deal, but if you mix say 1TB
> and 16 TB drives, you can end up exceeding `mon_max_pg_per_osd`.  Which
> is one reason why I like to increase it from the default value to, say,
> 400.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx