Hi Frank, A dummy question, what's this all-to-all rebuild/copy? Is that PG remapping when the broken disk is taken out? In your case, does "shut the OSD down" mark OSD "out"? "rebuilt to full redundancy" took 2 hours (I assume there was PG remapping.)? What's the disk size? Regarding to your future plan relying on all-to-all copy, "with large amount of hot spares", I assume you mean large amount of spare spaces? What do you do when a disk fails? Just take it out and let the cluster heal itself by remapping PGs from failed disk to spare spaces? Thanks! Tony > -----Original Message----- > From: Frank Schilder <frans@xxxxxx> > Sent: Saturday, November 28, 2020 12:42 AM > To: Anthony D'Atri <anthony.datri@xxxxxxxxx>; Tony Liu > <tonyliu0592@xxxxxxxxxxx> > Cc: ceph-users@xxxxxxx > Subject: Re: Re: replace osd with Octopus > > Hi all, > > maybe a further alternative. > > With our support contract I get exact replacements. I found out that > doing an off-line copy of a still readable OSD with ddrescue speeds > things up dramatically and avoids extended periods of degraded PGs. > > Situation and what I did: > > I had a disk with repeated deep scrub errors and checking with smartctl > I could see that it started remapping sectors. This showed up as PG > scrub error. I initiated a full deep scrub of the disk and run PG repair > on every PG that was marked as having errors. This way, ceph rewrites > the broken object and the disk writes it to a remapped, that is, healthy > sector. Doing this a couple of times will leave you with a disk that is > 100% readable. > > I then shut the OSD down. This lead to recovery IO as expected and after > less than 2 hours everything was rebuilt to full redundancy (it was > probably faster, I only checked after 2 hours). Recovery from single > disk fail is very fast due to all-to-all rebuild. > > In the mean time, I did a full disk copy with ddrescue to a large file > system space I have on a copy station. Took 16h for a 12TB drive. Right > after this, the replacement arrived and I copied the image back. Another > 16h. > > After this, I simply inserted the new disk with the 5 days old OSD copy > and brought it up (there was a weekend in between). Almost all objects > on the drive were still up-to-date and after just 30 minutes all PGs > were active and clean. Nothing remapped or misplaced any more. > > For comparison, I once added a single drive and it took 2 weeks for the > affected PGs to be active+clean again. The off-line copy can use much > more aggressive and effective IO to a single drive than ceph rebalancing > ever would. > > For single-disk exchange on our service contract I will probably > continue with the ddrescue method even though it requires manual action. > > For the future I plan to adapt a different strategy to utilize the all- > to-all copy capability of ceph. Exchanging single disks seems not to be > a good way to run ceph. I will rather have a larger amount of disks act > as hot spares. For example, having enough capacity that one can tolerate > loosing 10% of all disks before replacing anything. Adding a large > number of disks is overall more effective as it will basically take the > same time to get back to health OK as exchanging a single disk. > > With my timings, this "replace many disks not single ones" will amortise > if at least 5-6 drives failed and are down+out. It will also limit > writes to degraded PGs to the shortest interval possible. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Anthony D'Atri <anthony.datri@xxxxxxxxx> > Sent: 28 November 2020 05:55:06 > To: Tony Liu > Cc: ceph-users@xxxxxxx > Subject: Re: replace osd with Octopus > > >> > > > > Here is the context. > > https://docs.ceph.com/en/latest/mgr/orchestrator/#replace-an-osd > > > > When disk is broken, > > 1) orch osd rm <svc_id(s)> --replace [--force] > > 2) Replace disk. > > 3) ceph orch apply osd -i <osd_spec_file> > > > > Step #1 marks OSD "destroyed". I assume it has the same effect as > > "ceph osd destroy". And that keeps OSD "in", no PG remapping and > > cluster is in "degrade" state. > > > > After step #3, OSD will be "up" and "in", data will be recovered back > > to new disk. Is that right? > > Yes. > > > Is cluster "degrade" or "healthy" during such recovery? > > It will be degraded, because there are fewer copies of some data > available than during normal operation. Clients will continue to access > all data. > > > For another option, the difference is no "--replace" in step #1. > > 1) orch osd rm <svc_id(s)> [--force] > > 2) Replace disk. > > 3) ceph orch apply osd -i <osd_spec_file> > > > > Step #1 evacuates PGs from OSD and removes it from cluster. > > If disk is broken or OSD daemon is down, is this evacuation still > > going to work? > > Yes, of course - broken drives are the typical reason for removing OSDs. > > > Is it going to take a while if there is lots data on this disk? > > Yes, depending on what "a while" means to you, the size of the cluster, > whether the pool is replicated or EC, and whether these are HDDs or SSDs. > > > After step #3, PGs will be rebalanced/remapped again when new OSD > > joins the cluster. > > > > I think, to replace with the same disk model, option #1 is preferred, > > to replace with different disk model, it needs to be option #2. > > I haven't tried it under Octopus, but I don't think this is strictly > true. If you replace it with a different model that is approximately > the same size, everything will be fine. Through Luminous and I think > Nautilus at least, if you `destroy` and replace with a larger drive, the > CRUSH weight of the OSD will still reflect that of the old drive. You > could then run `ceph osd crush reweight` after deploying to adjust the > size. You could record the CRUSH weights of all your drive models for > initial OSD deploys, or you could `ceph osd tree` and look for another > OSD of the same model, and set the CRUSH weight accordingly. > > If you replace with a smaller drive, your cluster will lose a small > amount of usable capacity. If you replace with a larger drive, the > cluster may or may not enjoy a slight increase in capacity - that > depends on replication strategy, rack/host weights, etc. > > My personal philosophy on drive replacements: > > o Build OSDs with `-dmcrypt` so that you don't have to worry about data > if/when you RMA or recycle bad drives. RMAs are a hassle, so pick a > certain value threshold before a drive is worth the effort. This might > be in the $250-500 range for example, which means that for many HDDs it > isn't worth RMAing them. > > o If you have an exact replacement, use it > > o When buying spares, buy the largest size drive you have deployed - or > will deploy within the next year or so. That way you know that your > spares can take the place of any drive you have, so you don't have to > maintain stock of more than one size. Worst case you don't immediately > make good use of that extra capacity, but you may in the future as > drives in other failure domains fail and are replaced. Be careful, > though of mixing drives that a lot different in size. Mixing 12 and 14 > TB drives, even 12 and 16 is usually no big deal, but if you mix say 1TB > and 16 TB drives, you can end up exceeding `mon_max_pg_per_osd`. Which > is one reason why I like to increase it from the default value to, say, > 400. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an > email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx