Hello, On Mon, 20 Feb 2017 17:15:59 -0800 Gregory Farnum wrote: > On Mon, Feb 20, 2017 at 4:24 PM, Christian Balzer <chibi at gol.com> wrote: > > > > Hello, > > > > On Mon, 20 Feb 2017 14:12:52 -0800 Gregory Farnum wrote: > > > >> On Sat, Feb 18, 2017 at 12:39 AM, Nick Fisk <nick at fisk.me.uk> wrote: > >> > From what I understand in Jewel+ Ceph has the concept of an authorative > >> > shard, so in the case of a 3x replica pools, it will notice that 2 replicas > >> > match and one doesn't and use one of the good replicas. However, in a 2x > >> > pool your out of luck. > >> > > >> > However, if someone could confirm my suspicions that would be good as well. > >> > >> Hmm, I went digging in and sadly this isn't quite right. The code has > >> a lot of internal plumbing to allow more smarts than were previously > >> feasible and the erasure-coded pools make use of them for noticing > >> stuff like local corruption. Replicated pools make an attempt but it's > >> not as reliable as one would like and it still doesn't involve any > >> kind of voting mechanism. > > > > I seem to recall a lot of that plumbing going/being talked about, but > > never going into full action, good to know that I didn't miss anything and > > that my memory is still reliable. ^o^ > > Yeah. Mixed in with the subtlety are some good use cases, though. For > instance, anything written with RGW is always going to fit into the > cases where it will detect a bad primary. RBD is a lot less likely to > (unless you've done something crazy like set 4K objects, and the VM > always sends down 4k writes), but since scrubbing fills in the data > you can count on your snapshots and golden images being > well-protected. Etc etc. > > > > >> A self-inconsistent replicated primary won't get chosen. A primary is > >> self-inconsistent when its digest doesn't match the data, which > >> happens when: > >> 1) the object hasn't been written since it was last scrubbed, or > >> 2) the object was written in full, or > >> 3) the object has only been appended to since the last time its digest > >> was recorded, or > >> 4) something has gone terribly wrong in/under LevelDB and the omap > >> entries don't match what the digest says should be there. > >> > >> David knows more and correct if I'm missing something. He's also > >> working on interfaces for scrub that are more friendly in general and > >> allow administrators to make more fine-grained decisions about > >> recovery in ways that cooperate with RADOS. > >> > > That is certainly appreciated, especially if it gets backported to > > versions where people are stuck with FS based OSDs. > > > > However I presume that the main goal and focus is still BlueStore with > > live internal checksums that make scrubbing obsolete, right? > > I'm not sure what you mean. BlueStore certainly has a ton of work > going on, and we have plans to update scrub/repair to play nicely and > handle more of the use cases that BlueStore is likely to expose and > which FileStore did not. But just about all the scrub/repair > enhancements we're aiming at will work on both systems, and making > them handle the BlueStore cases may do a lot more proportionally for > FileStore. I'm talking about the various discussions here (google for "bluestore checksum", which also shows talks on devel, unsurprisingly) as well as Sage's various slides about Bluestore and checksums.