Re: Write Replication on Degraded PGs

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 13 Feb 2013 10:16:35 -0800

On Wed, Feb 13, 2013 at 3:40 AM, Ben Rowland <ben.rowland@xxxxxxxxx> wrote:
> Hi,
>
> Apologies that this is a fairly long post, but hopefully all my
> questions are similar (or even invalid!)
>
> Does Ceph allow writes to proceed if it's not possible to satisfy the
> rules for replica placement across failure domains, as specified in
> the CRUSH map?  For example, if my CRUSH map says to place one replica
> on each of 2 hosts, and no devices are up on one of the hosts, what
> will happen?

As you've discovered, it will write to only the up host. You can
control this behavior by setting the min write size on your pools —
"ceph osd set <pool> min_size <size>" — to be the minimum number of
writes you'd like to guarantee are on disk. You can also set the "osd
pool default min size" parameter on the monitors; it defaults to 0
which is interpreted as "half of the requested size". (So for a size 2
pool, it will require at least one copy [d'oh], for a size 5 pool, it
will require at least 3 copies, etc.)
So it sounds from the rest of your post like you'd want to, for each
pool that RGW uses (it's not just .rgw), run "ceph osd set .rgw
min_size 2". (and for .rgw.buckets, etc etc)

> From tests on a small cluster, my finding is that Ceph will allow
> writes to proceed in this case, even when there are less OSDs "in" the
> cluster than the replication size.
>
> If this is the expected behaviour of Ceph, then it seems to prefer
> write-availability over read-availability (in this case my data is
> only stored on 1 OSD, thus a SPOF).  Is there any way to change this
> trade-off, e.g. as you can in Cassandra with its write quorums?

I'm not quite sure this is describing it correctly — Ceph guarantees
that anything that's been written to disk will be readable later on,
and placement groups won't go active if they can't retrieve all data.
The sort of flexible policies allowed by Cassandra aren't possible
within Ceph — it is a strictly consistent system.

> From reading the paper which details the CRUSH replica placement
> algorithm, I understand several concepts:
>
> - The CRUSH algorithm loops over n replicas, for each descending from
> items in the current set (buckets) until it finds a device
> - "CRUSH may reject and reselect items using a modiﬁed input for three
> different reasons: if an item has already been selected in the current
> set (a collision..., if a device is failed, or if a device is
> overloaded."
>
> I'm not clear on what will happen if these constraints cannot be met,
> say in the case mentioned above.  Does Ceph store the object once
> only, not meeting the replica size, or does it store the object twice
> on the same OSD somehow?  The latter would violate point 2 above,
> unless the word "reselect" is appropriate here.

CRUSH has a specified retry size; if it can't meet the constraints
then it spits out whatever replicas it was able to select. The
higher-level system needs to decide what to do with that list. Ceph
chooses to use whatever CRUSH gives back, within the minimum size
constraint specified above and a few other override mechanisms we
don't need to into here. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html