Re: Auto recovering after loosing all copies of a PG(s)

Iain Buclaw <ibuclaw@xxxxxxxxx> · Tue, 20 Sep 2016 15:19:09 +0200

On 1 September 2016 at 23:04, Wido den Hollander <wido@xxxxxxxx> wrote:
>
>> Op 1 september 2016 om 17:37 schreef Iain Buclaw <ibuclaw@xxxxxxxxx>:
>>
>>
>> On 16 August 2016 at 17:13, Wido den Hollander <wido@xxxxxxxx> wrote:
>> >
>> >> Op 16 augustus 2016 om 15:59 schreef Iain Buclaw <ibuclaw@xxxxxxxxx>:
>> >>
>> >>
>> >> The desired behaviour for me would be for the client to get an instant
>> >> "not found" response from stat() operations.  For write() to recreate
>> >> unfound objects.  And for missing placement groups to be recreated on
>> >> an OSD that isn't overloaded.  Halting the entire cluster when 96% of
>> >> it can still be accessed is just not workable, I'm afraid.
>> >>
>> >
>> > Well, you can't make Ceph do that, but you can make librados do such a thing.
>> >
>> > I'm using the OSD and MON timeout settings in libvirt for example: http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=9665fbca3a18fbfc7e4caec3ee8e991e13513275;hb=HEAD#l157
>> >
>> > You can set these options:
>> > - client_mount_timeout
>> > - rados_mon_op_timeout
>> > - rados_osd_op_timeout
>> >
>> > Where I think only the last two should be sufficient in your case.
>> >
>> > You wel get ETIMEDOUT back as error when a operation times out.
>> >
>> > Wido
>> >
>>
>> This seems to be fine.
>>
>> Now what to do when a DR situation happens.
>>
>>
>>       pgmap v592589: 4096 pgs, 1 pools, 1889 GB data, 244 Mobjects
>>             2485 GB used, 10691 GB / 13263 GB avail
>>                 3902 active+clean
>>                  128 creating
>>                   66 incomplete
>>
>>
>> These PGs just never seem to finish creating.
>>
>
> I have seen that happen as well, you sometimes need to restart the OSDs to let the create finish.
>
> Wido
>

Just had another DR situation happen again over the weekend, and I can
confirm that setting client side timeouts did effectively nothing to
help the situation.  According to the ceph performance stats, the
total throughput of client operations went from 5000 per second to
just 20.  All clients are set with rados osd op timeout = 0.5, and are
using AIO.

Why must everything come to a halt internally when 1/30 OSDs of the
cluster is down? I managed only to get it up to 70 ops after forcibly
completing the PGs (stale+active+clean).  Then I got back up to normal
operations (-ish) after issuing force_create_pg, then stop and start
the OSD where the PG got moved to.

This is something that I'm trying to understand about ceph/librados.
If one disk is down, the whole system is collapses to a trickling low
rate that is not really any better than being completely down. It's as
if it cannot cope with loosing a disk that holds the only copy of a
PG.

As I've said before, the clients don't really care if data goes
missing or gets lost in the first place.  So long as accessible data
continues to be accessed without disruption, then everything will be
happy.

Is there a better way to make the cluster play happy in this scenario?
 As I've said before, the most desired behaviour I'm looking at is
just to recreate lost PGs and move on with it's life with zero impact
to the performance.

Lost data will always be recreated two days later by the clients that
check the validity of what's stored.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com