How safe is ceph pg repair these days?

chibi@xxxxxxx (Christian Balzer) · Tue, 21 Feb 2017 11:38:21 +0900

Hello,

On Mon, 20 Feb 2017 17:15:59 -0800 Gregory Farnum wrote:

> On Mon, Feb 20, 2017 at 4:24 PM, Christian Balzer <chibi at gol.com> wrote:
> >
> > Hello,
> >
> > On Mon, 20 Feb 2017 14:12:52 -0800 Gregory Farnum wrote:
> >  
> >> On Sat, Feb 18, 2017 at 12:39 AM, Nick Fisk <nick at fisk.me.uk> wrote:  
> >> > From what I understand in Jewel+ Ceph has the concept of an authorative
> >> > shard, so in the case of a 3x replica pools, it will notice that 2 replicas
> >> > match and one doesn't and use one of the good replicas. However, in a 2x
> >> > pool your out of luck.
> >> >
> >> > However, if someone could confirm my suspicions that would be good as well.  
> >>
> >> Hmm, I went digging in and sadly this isn't quite right. The code has
> >> a lot of internal plumbing to allow more smarts than were previously
> >> feasible and the erasure-coded pools make use of them for noticing
> >> stuff like local corruption. Replicated pools make an attempt but it's
> >> not as reliable as one would like and it still doesn't involve any
> >> kind of voting mechanism.  
> >
> > I seem to recall a lot of that plumbing going/being talked about, but
> > never going into full action, good to know that I didn't miss anything and
> > that my memory is still reliable. ^o^  
> 
> Yeah. Mixed in with the subtlety are some good use cases, though. For
> instance, anything written with RGW is always going to fit into the
> cases where it will detect a bad primary. RBD is a lot less likely to
> (unless you've done something crazy like set 4K objects, and the VM
> always sends down 4k writes), but since scrubbing fills in the data
> you can count on your snapshots and golden images being
> well-protected. Etc etc.
> 
> >  
> >> A self-inconsistent replicated primary won't get chosen. A primary is
> >> self-inconsistent when its digest doesn't match the data, which
> >> happens when:
> >> 1) the object hasn't been written since it was last scrubbed, or
> >> 2) the object was written in full, or
> >> 3) the object has only been appended to since the last time its digest
> >> was recorded, or
> >> 4) something has gone terribly wrong in/under LevelDB and the omap
> >> entries don't match what the digest says should be there.
> >>
> >> David knows more and correct if I'm missing something. He's also
> >> working on interfaces for scrub that are more friendly in general and
> >> allow administrators to make more fine-grained decisions about
> >> recovery in ways that cooperate with RADOS.
> >>  
> > That is certainly appreciated, especially if it gets backported to
> > versions where people are stuck with FS based OSDs.
> >
> > However I presume that the main goal and focus is still BlueStore with
> > live internal checksums that make scrubbing obsolete, right?  
> 
> I'm not sure what you mean. BlueStore certainly has a ton of work
> going on, and we have plans to update scrub/repair to play nicely and
> handle more of the use cases that BlueStore is likely to expose and
> which FileStore did not. But just about all the scrub/repair
> enhancements we're aiming at will work on both systems, and making
> them handle the BlueStore cases may do a lot more proportionally for
> FileStore.

I'm talking about the various discussions here (google for "bluestore
checksum", which also shows talks on devel, unsurprisingly) as well as
Sage's various slides about Bluestore and checksums.