Re: Ceph PG Incomplete = Cluster unusable

Christian Balzer <chibi@xxxxxxx> · Fri, 9 Jan 2015 12:31:56 +0900

On Thu, 8 Jan 2015 11:41:37 -0700 Robert LeBlanc wrote:

> On Wed, Jan 7, 2015 at 10:55 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> > Which of course begs the question of why not having min_size at 1
> > permanently, so that in the (hopefully rare) case of loosing 2 OSDs at
> > the same time your cluster still keeps working (as it should with a
> > size of 3).
> 
> The idea is that when a write happens at least min_size has it
> committed on disk before the write is committed back to the client.
> Just in case something happens to the disk before it can be
> replicated. It also goes against the strongly consistent model of
> Ceph.
> 
Which of course currently means a strongly consistent lockup in these
scenarios. ^o^

Slightly off-topic and snarky, that strong consistency is of course of
limited use when in the case of a corrupted PG Ceph basically asks you to
toss a coin.
As in minor corruption, impossible for a mere human to tell which
replica is the good one, because one OSD is down and the 2 remaining ones
differ by one bit or so.

> I believe there is work to resolve the issue when the number of
> replicas drops below min_number. Ceph should automatically start
> backfilling to get to at least min_num so that I/O can continue. I
> believe this work is also tied to prioritizing backfilling so that
> things like this are backfilled first, then backfilling min_num to get
> back to size.
> 
Yeah, I suppose that is what Greg referred to. 
Hopefully soon and backported if possible.

> I am interested in a not-so-strict eventual consistency option in Ceph
> so that under normal circumstances instead of needing [size] writes to
> OSDs to complete, only [min_num] is needed and the primary OSD then
> ensures that the laggy OSD(s) eventually gets the write committed.
> 
This is exactly where I was coming from/getting at. 

And basically what artificially setting min size to 1 in a replica 3
cluster should get you, unless I'm missing something.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com