Re: Ceph PG Incomplete = Cluster unusable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 8 Jan 2015 21:17:12 -0700 Robert LeBlanc wrote:

> On Thu, Jan 8, 2015 at 8:31 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> > On Thu, 8 Jan 2015 11:41:37 -0700 Robert LeBlanc wrote:
> > Which of course currently means a strongly consistent lockup in these
> > scenarios. ^o^
> 
> That is one way of putting it
> 
If I had the time and more importantly the talent to help with code, I'd
do so. 
Failing that, pointing out the often painful truth is something I can do.

> > Slightly off-topic and snarky, that strong consistency is of course of
> > limited use when in the case of a corrupted PG Ceph basically asks you
> > to toss a coin.
> > As in minor corruption, impossible for a mere human to tell which
> > replica is the good one, because one OSD is down and the 2 remaining
> > ones differ by one bit or so.
> 
> This is where checksumming is supposed to come in. I think Sage has been
> leading that initiative. 

Yeah, I'm aware of that effort. 
Of course in the meantime even a very simple majority vote would be most
welcome and helpful in nearly all cases (with 3 replicas available).

One wonders if this is basically acknowledging that while offloading some
things like checksums to the underlying layer/FS are desirable from a
codebase/effort/complexity view, neither BTRFS or ZFS are fully production
ready and won't be for some time.

> Basically, when an OSD reads an object it should
> be able to tell if there was bit rot by hashing what it just read and
> checking the MD5SUM that it did when it first received the object. If it
> doesn't match it can ask another OSD until it finds one that matches.
> 
> This provides a number of benefits:
> 
>    1. Protect against bit rot. Checked on read and on deep scrub.
>    2. Automatically recover the correct version of the object.
>    3. If the client computes the MD5SUM before it sent over the wire, the
>    data can be guaranteed through the memory of several
>    machines/devices/cables/etc.
>    4. Getting by with "size" 2 is less risky for those who really want to
>    do that.
> 
> With all these benefits, there is a trade-off associated with it, mostly
> CPU. However with the inclusion of AES in silicon, it may not be a huge
> issue now. But, I'm not a programmer and familiar with the aspect of the
> Ceph code to be authoritative in any way.

Yup, all very useful and pertinent points.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux