Re: Auto recovering after loosing all copies of a PG(s)

Iain Buclaw <ibuclaw@xxxxxxxxx> · Thu, 1 Sep 2016 17:37:00 +0200

On 16 August 2016 at 17:13, Wido den Hollander <wido@xxxxxxxx> wrote:
>
>> Op 16 augustus 2016 om 15:59 schreef Iain Buclaw <ibuclaw@xxxxxxxxx>:
>>
>>
>> The desired behaviour for me would be for the client to get an instant
>> "not found" response from stat() operations.  For write() to recreate
>> unfound objects.  And for missing placement groups to be recreated on
>> an OSD that isn't overloaded.  Halting the entire cluster when 96% of
>> it can still be accessed is just not workable, I'm afraid.
>>
>
> Well, you can't make Ceph do that, but you can make librados do such a thing.
>
> I'm using the OSD and MON timeout settings in libvirt for example: http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=9665fbca3a18fbfc7e4caec3ee8e991e13513275;hb=HEAD#l157
>
> You can set these options:
> - client_mount_timeout
> - rados_mon_op_timeout
> - rados_osd_op_timeout
>
> Where I think only the last two should be sufficient in your case.
>
> You wel get ETIMEDOUT back as error when a operation times out.
>
> Wido
>

This seems to be fine.

Now what to do when a DR situation happens.

      pgmap v592589: 4096 pgs, 1 pools, 1889 GB data, 244 Mobjects
            2485 GB used, 10691 GB / 13263 GB avail
                3902 active+clean
                 128 creating
                  66 incomplete

These PGs just never seem to finish creating.

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com