On Tue, Sep 20, 2016 at 6:19 AM, Iain Buclaw <ibuclaw@xxxxxxxxx> wrote: > On 1 September 2016 at 23:04, Wido den Hollander <wido@xxxxxxxx> wrote: >> >>> Op 1 september 2016 om 17:37 schreef Iain Buclaw <ibuclaw@xxxxxxxxx>: >>> >>> >>> On 16 August 2016 at 17:13, Wido den Hollander <wido@xxxxxxxx> wrote: >>> > >>> >> Op 16 augustus 2016 om 15:59 schreef Iain Buclaw <ibuclaw@xxxxxxxxx>: >>> >> >>> >> >>> >> The desired behaviour for me would be for the client to get an instant >>> >> "not found" response from stat() operations. For write() to recreate >>> >> unfound objects. And for missing placement groups to be recreated on >>> >> an OSD that isn't overloaded. Halting the entire cluster when 96% of >>> >> it can still be accessed is just not workable, I'm afraid. >>> >> >>> > >>> > Well, you can't make Ceph do that, but you can make librados do such a thing. >>> > >>> > I'm using the OSD and MON timeout settings in libvirt for example: http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=9665fbca3a18fbfc7e4caec3ee8e991e13513275;hb=HEAD#l157 >>> > >>> > You can set these options: >>> > - client_mount_timeout >>> > - rados_mon_op_timeout >>> > - rados_osd_op_timeout >>> > >>> > Where I think only the last two should be sufficient in your case. >>> > >>> > You wel get ETIMEDOUT back as error when a operation times out. >>> > >>> > Wido >>> > >>> >>> This seems to be fine. >>> >>> Now what to do when a DR situation happens. >>> >>> >>> pgmap v592589: 4096 pgs, 1 pools, 1889 GB data, 244 Mobjects >>> 2485 GB used, 10691 GB / 13263 GB avail >>> 3902 active+clean >>> 128 creating >>> 66 incomplete >>> >>> >>> These PGs just never seem to finish creating. >>> >> >> I have seen that happen as well, you sometimes need to restart the OSDs to let the create finish. >> >> Wido >> > > Just had another DR situation happen again over the weekend, and I can > confirm that setting client side timeouts did effectively nothing to > help the situation. According to the ceph performance stats, the > total throughput of client operations went from 5000 per second to > just 20. All clients are set with rados osd op timeout = 0.5, and are > using AIO. > > Why must everything come to a halt internally when 1/30 OSDs of the > cluster is down? I managed only to get it up to 70 ops after forcibly > completing the PGs (stale+active+clean). Then I got back up to normal > operations (-ish) after issuing force_create_pg, then stop and start > the OSD where the PG got moved to. > > This is something that I'm trying to understand about ceph/librados. > If one disk is down, the whole system is collapses to a trickling low > rate that is not really any better than being completely down. It's as > if it cannot cope with loosing a disk that holds the only copy of a > PG. Yes; the whole system is designed to prevent this. I understand your use case but unfortunately Ceph would require a fair bit of surgery to really be happy as a disposable object store. You might be able to hack it together by having the OSD checks for down PGs return an error code instead of putting requests on a waitlist, and by having clients which see that error send off monitor commands, but it would definitely be a hack. -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com