Re: Auto recovering after loosing all copies of a PG(s)

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 20 Sep 2016 14:06:14 -0700



On Tue, Sep 20, 2016 at 6:19 AM, Iain Buclaw <ibuclaw@xxxxxxxxx> wrote:
> On 1 September 2016 at 23:04, Wido den Hollander <wido@xxxxxxxx> wrote:
>>
>>> Op 1 september 2016 om 17:37 schreef Iain Buclaw <ibuclaw@xxxxxxxxx>:
>>>
>>>
>>> On 16 August 2016 at 17:13, Wido den Hollander <wido@xxxxxxxx> wrote:
>>> >
>>> >> Op 16 augustus 2016 om 15:59 schreef Iain Buclaw <ibuclaw@xxxxxxxxx>:
>>> >>
>>> >>
>>> >> The desired behaviour for me would be for the client to get an instant
>>> >> "not found" response from stat() operations.  For write() to recreate
>>> >> unfound objects.  And for missing placement groups to be recreated on
>>> >> an OSD that isn't overloaded.  Halting the entire cluster when 96% of
>>> >> it can still be accessed is just not workable, I'm afraid.
>>> >>
>>> >
>>> > Well, you can't make Ceph do that, but you can make librados do such a thing.
>>> >
>>> > I'm using the OSD and MON timeout settings in libvirt for example: http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=9665fbca3a18fbfc7e4caec3ee8e991e13513275;hb=HEAD#l157
>>> >
>>> > You can set these options:
>>> > - client_mount_timeout
>>> > - rados_mon_op_timeout
>>> > - rados_osd_op_timeout
>>> >
>>> > Where I think only the last two should be sufficient in your case.
>>> >
>>> > You wel get ETIMEDOUT back as error when a operation times out.
>>> >
>>> > Wido
>>> >
>>>
>>> This seems to be fine.
>>>
>>> Now what to do when a DR situation happens.
>>>
>>>
>>>       pgmap v592589: 4096 pgs, 1 pools, 1889 GB data, 244 Mobjects
>>>             2485 GB used, 10691 GB / 13263 GB avail
>>>                 3902 active+clean
>>>                  128 creating
>>>                   66 incomplete
>>>
>>>
>>> These PGs just never seem to finish creating.
>>>
>>
>> I have seen that happen as well, you sometimes need to restart the OSDs to let the create finish.
>>
>> Wido
>>
>
> Just had another DR situation happen again over the weekend, and I can
> confirm that setting client side timeouts did effectively nothing to
> help the situation.  According to the ceph performance stats, the
> total throughput of client operations went from 5000 per second to
> just 20.  All clients are set with rados osd op timeout = 0.5, and are
> using AIO.
>
> Why must everything come to a halt internally when 1/30 OSDs of the
> cluster is down? I managed only to get it up to 70 ops after forcibly
> completing the PGs (stale+active+clean).  Then I got back up to normal
> operations (-ish) after issuing force_create_pg, then stop and start
> the OSD where the PG got moved to.
>
> This is something that I'm trying to understand about ceph/librados.
> If one disk is down, the whole system is collapses to a trickling low
> rate that is not really any better than being completely down. It's as
> if it cannot cope with loosing a disk that holds the only copy of a
> PG.

Yes; the whole system is designed to prevent this. I understand your
use case but unfortunately Ceph would require a fair bit of surgery to
really be happy as a disposable object store. You might be able to
hack it together by having the OSD checks for down PGs return an error
code instead of putting requests on a waitlist, and by having clients
which see that error send off monitor commands, but it would
definitely be a hack.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com