Re: Write Replication on Degraded PGs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Sam,

I can still reproduce it.  I'm not clear if this is actually the
expected behaviour of Ceph: if reads/writes are done at the primary
OSD, and if a new primary can't be 'elected' (say due to a net-split
between failure domains), then is a failure expected, for consistency
guarantees?  Or am I missing something?  If this is the case, then
we'll have to rule out Ceph as it would not be appropriate for our
use-case.  We need high availability across failure domains, which
could become split from one another say by a network failure,
resulting in an incomplete PG.  In this case we still need read
availability.

I tried to enable osd logging by adding: "debug osd = 20" to the [osd]
section of my ceph.conf on the requesting machine, but didn't get much
output (see below).  Could the fundamental issue be that the primary
OSD on the other machine is down (intentionally, for our test case)
and no other primary can be elected (as the CRUSH rule demands one OSD
on each host)?  Apologies for any speculation on my part here, any
clarification will help a lot!

2013-02-18 10:11:51.913256 osd.0 10.9.64.61:6801/25064 5 : [WRN] 2
slow requests, 1 included below; oldest blocked for > 95.700672 secs
2013-02-18 10:11:51.913290 osd.0 10.9.64.61:6801/25064 6 : [WRN] slow
request 30.976297 seconds old, received at 2013-02-18 10:11:20.936876:
osd_op(client.4345.0:29594
4345.365_91bf7acb-8321-494e-bc79-6ab1625162bc [getxattrs,stat,read
0~524288] 9.5aaf1592) v4 currently reached pg

Thanks,

Ben

On Sat, Feb 16, 2013 at 5:42 PM, Sam Lang <sam.lang@xxxxxxxxxxx> wrote:
> On Fri, Feb 15, 2013 at 6:29 AM, Ben Rowland <ben.rowland@xxxxxxxxx> wrote:
>> Further to my question about reads on a degraded PG, my tests show
>> that indeed reads from rgw fail when not all OSDs in a PG are up, even
>> when the data is physically available on an up/in OSD.
>>
>> I have a "size" and "min_size" of 2 on my pool, and 2 hosts with 2
>> OSDs on each.  Crush map is set to write to 1 OSD on each of 2 hosts.
>> After writing a file to successfully to rgw via host 1, I then stop
>> all Ceph services on host 2.  Attempts to read the file I just wrote
>> time out after 30 seconds.  Starting Ceph again on host 2 allows reads
>> to proceed from host 1 once again.
>>
>> I see the following in ceph.log after the read times out:
>>
>> 2013-02-15 12:04:39.162685 osd.0 10.9.64.61:6802/19242 3 : [WRN] slow
>> request 30.461867 seconds old, received at 2013-02-15 12:04:08.700630:
>> osd_op(client.4345.0:21511
>> 4345.365_91bf7acb-8321-494e-bc79-6ab1625162bc [getxattrs,stat,read
>> 0~524288] 9.5aaf1592 RETRY) v4 currently reached pg
>>
>> After stopping Ceph on host 2, "ceph -s" reports:
>>
>>    health HEALTH_WARN 514 pgs degraded; 16 pgs incomplete; 16 pgs
>> stuck inactive; 632 pgs stuck unclean; recovery 44/6804 degraded
>> (0.647%)
>>    monmap e1: 1 mons at {a=10.9.64.61:6789/0}, election epoch 1, quorum 0 a
>>    osdmap e155: 4 osds: 2 up, 2 in
>>     pgmap v4911: 632 pgs: 102 active+remapped, 514 active+degraded, 16
>> incomplete; 844 MB data, 5969 MB used, 2280 MB / 8691 MB avail;
>> 44/6804 degraded (0.647%)
>>    mdsmap e1: 0/0/1 up
>>
>> OSD tree just in case:
>>
>> # id weight type name up/down reweight
>> -1 2 root default
>> -3 2 rack unknownrack
>> -2 1 host squeezeceph1
>> 0 1 osd.0 up 1
>> 2 1 osd.2 up 1
>> -4 1 host squeezeceph2
>> 1 1 osd.1 down 0
>> 3 0 osd.3 down 0
>>
>> Running "osd map" on both the container and object names say host 1 is
>> "acting" for that PG (not sure if I'm looking at the right pools,
>> though):
>>
>> $ ceph osd map .rgw.buckets aa94e84a-e720-45e1-8c85-4afa7d0f6b5c
>>
>> osdmap e155 pool '.rgw.buckets' (9) object
>> 'aa94e84a-e720-45e1-8c85-4afa7d0f6b5c' -> pg 9.494717b9 (9.1) -> up
>> [0] acting [0]
>>
>> $ ceph osd map .rgw 91bf7acb-8321-494e-bc79-6ab1625162bc
>>
>> osdmap e155 pool '.rgw' (3) object
>> '91bf7acb-8321-494e-bc79-6ab1625162bc' -> pg 3.1db18d16 (3.6) -> up
>> [2] acting [2]
>>
>> Any thoughts?  It doesn't seem right that taking out a single failure
>> domain should cause this degradation.
>
> Hi Ben,
>
> Are you still seeing this?  Can you enable osd logging and restart the
> osds on host 1?
> -sam
>
>>
>> Many thanks,
>>
>> Ben
>>
>> On Thu, Feb 14, 2013 at 11:53 PM, Ben Rowland <ben.rowland@xxxxxxxxx> wrote:
>>>> On 13 Feb 2013 18:16, "Gregory Farnum" <greg@xxxxxxxxxxx> wrote:
>>>> >
>>>> > On Wed, Feb 13, 2013 at 3:40 AM, Ben Rowland <ben.rowland@xxxxxxxxx> wrote:
>>>
>>>> So it sounds from the rest of your post like you'd want to, for each
>>>> pool that RGW uses (it's not just .rgw), run "ceph osd set .rgw
>>>> min_size 2". (and for .rgw.buckets, etc etc)
>>>
>>> Thanks, that did the trick. When the number of up OSDs is less than
>>> min_size, writes block for 30s then return http 500. Ceph honours my
>>> crush rule in this case - adding more OSDs to only one of two failure
>>> domains continues to block writes - all well and good!
>>>
>>>> > If this is the expected behaviour of Ceph, then it seems to prefer
>>>> > write-availability over read-availability (in this case my data is
>>>> > only stored on 1 OSD, thus a SPOF).  Is there any way to change this
>>>> > trade-off, e.g. as you can in Cassandra with its write quorums?
>>>>
>>>> I'm not quite sure this is describing it correctly — Ceph guarantees
>>>> that anything that's been written to disk will be readable later on,
>>>> and placement groups won't go active if they can't retrieve all data.
>>>> The sort of flexible policies allowed by Cassandra aren't possible
>>>> within Ceph — it is a strictly consistent system.
>>>
>>> Are objects always readable even if a PG is missing some OSDs, and
>>> where it cannot recover? Example: 2 hosts each with 1 osd, pool
>>> min_size is 2, with a crush rule saying to write to both hosts. I
>>> write a file successfully, then one host goes down, and eventually is
>>> marked 'out'. Is the file readable on the 'up' host (say if I'm
>>> running rgw there?) What if the up host does not have the primary
>>> copy?
>>>
>>> Furthermore, if Ceph is strictly consistent, how would it resolve
>>> possible stale reads? Say, if in the 2 hosts example, the network
>>> connection died, but min_size was set to 1. Would it be possible for
>>> writes to proceed, say making edits to an existing object? Could
>>> readers at the other host see stale data?
>>>
>>> Thanks again in advance,
>>>
>>> Ben
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux