How safe is ceph pg repair these days?

chibi@xxxxxxx (Christian Balzer) · Tue, 21 Feb 2017 09:24:02 +0900

Hello,

On Mon, 20 Feb 2017 14:12:52 -0800 Gregory Farnum wrote:

> On Sat, Feb 18, 2017 at 12:39 AM, Nick Fisk <nick at fisk.me.uk> wrote:
> > From what I understand in Jewel+ Ceph has the concept of an authorative
> > shard, so in the case of a 3x replica pools, it will notice that 2 replicas
> > match and one doesn't and use one of the good replicas. However, in a 2x
> > pool your out of luck.
> >
> > However, if someone could confirm my suspicions that would be good as well.  
> 
> Hmm, I went digging in and sadly this isn't quite right. The code has
> a lot of internal plumbing to allow more smarts than were previously
> feasible and the erasure-coded pools make use of them for noticing
> stuff like local corruption. Replicated pools make an attempt but it's
> not as reliable as one would like and it still doesn't involve any
> kind of voting mechanism.

I seem to recall a lot of that plumbing going/being talked about, but
never going into full action, good to know that I didn't miss anything and
that my memory is still reliable. ^o^

> A self-inconsistent replicated primary won't get chosen. A primary is
> self-inconsistent when its digest doesn't match the data, which
> happens when:
> 1) the object hasn't been written since it was last scrubbed, or
> 2) the object was written in full, or
> 3) the object has only been appended to since the last time its digest
> was recorded, or
> 4) something has gone terribly wrong in/under LevelDB and the omap
> entries don't match what the digest says should be there.
> 
> David knows more and correct if I'm missing something. He's also
> working on interfaces for scrub that are more friendly in general and
> allow administrators to make more fine-grained decisions about
> recovery in ways that cooperate with RADOS.
>
That is certainly appreciated, especially if it gets backported to
versions where people are stuck with FS based OSDs.

However I presume that the main goal and focus is still BlueStore with
live internal checksums that make scrubbing obsolete, right?

Christian

> -Greg
> 
> >  
> >> -----Original Message-----
> >> From: ceph-users [mailto:ceph-users-bounces at lists.ceph.com] On Behalf Of
> >> Tracy Reed
> >> Sent: 18 February 2017 03:06
> >> To: Shinobu Kinjo <skinjo at redhat.com>
> >> Cc: ceph-users <ceph-users at ceph.com>
> >> Subject: Re: [ceph-users] How safe is ceph pg repair these days?
> >>
> >> Well, that's the question...is that safe? Because the link to the mailing  
> > list  
> >> post (possibly outdated) says that what you just suggested is definitely  
> > NOT  
> >> safe. Is the mailing list post wrong? Has the situation changed? Exactly  
> > what  
> >> does ceph repair do now? I suppose I could go dig into the code but I'm  
> > not  
> >> an expert and would hate to get it wrong and post possibly bogus info the
> >> the list for other newbies to find and worry about and possibly lose their
> >> data.
> >>
> >> On Fri, Feb 17, 2017 at 06:08:39PM PST, Shinobu Kinjo spake thusly:  
> >> > if ``ceph pg deep-scrub <pg id>`` does not work then
> >> >   do
> >> >     ``ceph pg repair <pg id>
> >> >
> >> >
> >> > On Sat, Feb 18, 2017 at 10:02 AM, Tracy Reed <treed at ultraviolet.org>  
> >> wrote:  
> >> > > I have a 3 replica cluster. A couple times I have run into
> >> > > inconsistent PGs. I googled it and ceph docs and various blogs say
> >> > > run a repair first. But a couple people on IRC and a mailing list
> >> > > thread from 2015 say that ceph blindly copies the primary over the
> >> > > secondaries and calls it good.
> >> > >
> >> > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-  
> >> May/001370.  
> >> > > html
> >> > >
> >> > > I sure hope that isn't the case. If so it would seem highly
> >> > > irresponsible to implement such a naive command called "repair". I
> >> > > have recently learned how to properly analyze the OSD logs and
> >> > > manually fix these things but not before having run repair on a
> >> > > dozen inconsistent PGs. Now I'm worried about what sort of
> >> > > corruption I may have introduced. Repairing things by hand is a
> >> > > simple heuristic based on comparing the size or checksum (as
> >> > > indicated by the logs) for each of the 3 copies and figuring out
> >> > > which is correct. Presumably matching two out of three should win
> >> > > and the odd object out should be deleted since having the exact same
> >> > > kind of error on two different OSDs is highly improbable. I don't
> >> > > understand why ceph repair wouldn't have done this all along.
> >> > >
> >> > > What is the current best practice in the use of ceph repair?
> >> > >
> >> > > Thanks!
> >> > >
> >> > > --
> >> > > Tracy Reed
> >> > >
> >> > > _______________________________________________
> >> > > ceph-users mailing list
> >> > > ceph-users at lists.ceph.com
> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > >  
> >>
> >> --
> >> Tracy Reed  
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users at lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/