Hello, On Mon, 20 Feb 2017 14:12:52 -0800 Gregory Farnum wrote: > On Sat, Feb 18, 2017 at 12:39 AM, Nick Fisk <nick at fisk.me.uk> wrote: > > From what I understand in Jewel+ Ceph has the concept of an authorative > > shard, so in the case of a 3x replica pools, it will notice that 2 replicas > > match and one doesn't and use one of the good replicas. However, in a 2x > > pool your out of luck. > > > > However, if someone could confirm my suspicions that would be good as well. > > Hmm, I went digging in and sadly this isn't quite right. The code has > a lot of internal plumbing to allow more smarts than were previously > feasible and the erasure-coded pools make use of them for noticing > stuff like local corruption. Replicated pools make an attempt but it's > not as reliable as one would like and it still doesn't involve any > kind of voting mechanism. I seem to recall a lot of that plumbing going/being talked about, but never going into full action, good to know that I didn't miss anything and that my memory is still reliable. ^o^ > A self-inconsistent replicated primary won't get chosen. A primary is > self-inconsistent when its digest doesn't match the data, which > happens when: > 1) the object hasn't been written since it was last scrubbed, or > 2) the object was written in full, or > 3) the object has only been appended to since the last time its digest > was recorded, or > 4) something has gone terribly wrong in/under LevelDB and the omap > entries don't match what the digest says should be there. > > David knows more and correct if I'm missing something. He's also > working on interfaces for scrub that are more friendly in general and > allow administrators to make more fine-grained decisions about > recovery in ways that cooperate with RADOS. > That is certainly appreciated, especially if it gets backported to versions where people are stuck with FS based OSDs. However I presume that the main goal and focus is still BlueStore with live internal checksums that make scrubbing obsolete, right? Christian > -Greg > > > > >> -----Original Message----- > >> From: ceph-users [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of > >> Tracy Reed > >> Sent: 18 February 2017 03:06 > >> To: Shinobu Kinjo <skinjo at redhat.com> > >> Cc: ceph-users <ceph-users at ceph.com> > >> Subject: Re: [ceph-users] How safe is ceph pg repair these days? > >> > >> Well, that's the question...is that safe? Because the link to the mailing > > list > >> post (possibly outdated) says that what you just suggested is definitely > > NOT > >> safe. Is the mailing list post wrong? Has the situation changed? Exactly > > what > >> does ceph repair do now? I suppose I could go dig into the code but I'm > > not > >> an expert and would hate to get it wrong and post possibly bogus info the > >> the list for other newbies to find and worry about and possibly lose their > >> data. > >> > >> On Fri, Feb 17, 2017 at 06:08:39PM PST, Shinobu Kinjo spake thusly: > >> > if ``ceph pg deep-scrub <pg id>`` does not work then > >> > do > >> > ``ceph pg repair <pg id> > >> > > >> > > >> > On Sat, Feb 18, 2017 at 10:02 AM, Tracy Reed <treed at ultraviolet.org> > >> wrote: > >> > > I have a 3 replica cluster. A couple times I have run into > >> > > inconsistent PGs. I googled it and ceph docs and various blogs say > >> > > run a repair first. But a couple people on IRC and a mailing list > >> > > thread from 2015 say that ceph blindly copies the primary over the > >> > > secondaries and calls it good. > >> > > > >> > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015- > >> May/001370. > >> > > html > >> > > > >> > > I sure hope that isn't the case. If so it would seem highly > >> > > irresponsible to implement such a naive command called "repair". I > >> > > have recently learned how to properly analyze the OSD logs and > >> > > manually fix these things but not before having run repair on a > >> > > dozen inconsistent PGs. Now I'm worried about what sort of > >> > > corruption I may have introduced. Repairing things by hand is a > >> > > simple heuristic based on comparing the size or checksum (as > >> > > indicated by the logs) for each of the 3 copies and figuring out > >> > > which is correct. Presumably matching two out of three should win > >> > > and the odd object out should be deleted since having the exact same > >> > > kind of error on two different OSDs is highly improbable. I don't > >> > > understand why ceph repair wouldn't have done this all along. > >> > > > >> > > What is the current best practice in the use of ceph repair? > >> > > > >> > > Thanks! > >> > > > >> > > -- > >> > > Tracy Reed > >> > > > >> > > _______________________________________________ > >> > > ceph-users mailing list > >> > > ceph-users at lists.ceph.com > >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > > >> > >> -- > >> Tracy Reed > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users at lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/