Re: Write Replication on Degraded PGs

Ben Rowland <ben.rowland@xxxxxxxxx> · Thu, 14 Feb 2013 23:53:05 +0000

> On 13 Feb 2013 18:16, "Gregory Farnum" <greg@xxxxxxxxxxx> wrote:
> >
> > On Wed, Feb 13, 2013 at 3:40 AM, Ben Rowland <ben.rowland@xxxxxxxxx> wrote:

> So it sounds from the rest of your post like you'd want to, for each
> pool that RGW uses (it's not just .rgw), run "ceph osd set .rgw
> min_size 2". (and for .rgw.buckets, etc etc)

Thanks, that did the trick. When the number of up OSDs is less than
min_size, writes block for 30s then return http 500. Ceph honours my
crush rule in this case - adding more OSDs to only one of two failure
domains continues to block writes - all well and good!

> > If this is the expected behaviour of Ceph, then it seems to prefer
> > write-availability over read-availability (in this case my data is
> > only stored on 1 OSD, thus a SPOF).  Is there any way to change this
> > trade-off, e.g. as you can in Cassandra with its write quorums?
>
> I'm not quite sure this is describing it correctly — Ceph guarantees
> that anything that's been written to disk will be readable later on,
> and placement groups won't go active if they can't retrieve all data.
> The sort of flexible policies allowed by Cassandra aren't possible
> within Ceph — it is a strictly consistent system.

Are objects always readable even if a PG is missing some OSDs, and
where it cannot recover? Example: 2 hosts each with 1 osd, pool
min_size is 2, with a crush rule saying to write to both hosts. I
write a file successfully, then one host goes down, and eventually is
marked 'out'. Is the file readable on the 'up' host (say if I'm
running rgw there?) What if the up host does not have the primary
copy?

Furthermore, if Ceph is strictly consistent, how would it resolve
possible stale reads? Say, if in the 2 hosts example, the network
connection died, but min_size was set to 1. Would it be possible for
writes to proceed, say making edits to an existing object? Could
readers at the other host see stale data?

Thanks again in advance,

Ben
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html