On 1 September 2016 at 23:04, Wido den Hollander <wido@xxxxxxxx> wrote: > >> Op 1 september 2016 om 17:37 schreef Iain Buclaw <ibuclaw@xxxxxxxxx>: >> >> >> On 16 August 2016 at 17:13, Wido den Hollander <wido@xxxxxxxx> wrote: >> > >> >> Op 16 augustus 2016 om 15:59 schreef Iain Buclaw <ibuclaw@xxxxxxxxx>: >> >> >> >> >> >> The desired behaviour for me would be for the client to get an instant >> >> "not found" response from stat() operations. For write() to recreate >> >> unfound objects. And for missing placement groups to be recreated on >> >> an OSD that isn't overloaded. Halting the entire cluster when 96% of >> >> it can still be accessed is just not workable, I'm afraid. >> >> >> > >> > Well, you can't make Ceph do that, but you can make librados do such a thing. >> > >> > I'm using the OSD and MON timeout settings in libvirt for example: http://libvirt.org/git/?p=libvirt.git;a=blob;f=src/storage/storage_backend_rbd.c;h=9665fbca3a18fbfc7e4caec3ee8e991e13513275;hb=HEAD#l157 >> > >> > You can set these options: >> > - client_mount_timeout >> > - rados_mon_op_timeout >> > - rados_osd_op_timeout >> > >> > Where I think only the last two should be sufficient in your case. >> > >> > You wel get ETIMEDOUT back as error when a operation times out. >> > >> > Wido >> > >> >> This seems to be fine. >> >> Now what to do when a DR situation happens. >> >> >> pgmap v592589: 4096 pgs, 1 pools, 1889 GB data, 244 Mobjects >> 2485 GB used, 10691 GB / 13263 GB avail >> 3902 active+clean >> 128 creating >> 66 incomplete >> >> >> These PGs just never seem to finish creating. >> > > I have seen that happen as well, you sometimes need to restart the OSDs to let the create finish. > > Wido > Just had another DR situation happen again over the weekend, and I can confirm that setting client side timeouts did effectively nothing to help the situation. According to the ceph performance stats, the total throughput of client operations went from 5000 per second to just 20. All clients are set with rados osd op timeout = 0.5, and are using AIO. Why must everything come to a halt internally when 1/30 OSDs of the cluster is down? I managed only to get it up to 70 ops after forcibly completing the PGs (stale+active+clean). Then I got back up to normal operations (-ish) after issuing force_create_pg, then stop and start the OSD where the PG got moved to. This is something that I'm trying to understand about ceph/librados. If one disk is down, the whole system is collapses to a trickling low rate that is not really any better than being completely down. It's as if it cannot cope with loosing a disk that holds the only copy of a PG. As I've said before, the clients don't really care if data goes missing or gets lost in the first place. So long as accessible data continues to be accessed without disruption, then everything will be happy. Is there a better way to make the cluster play happy in this scenario? As I've said before, the most desired behaviour I'm looking at is just to recreate lost PGs and move on with it's life with zero impact to the performance. Lost data will always be recreated two days later by the clients that check the validity of what's stored. -- Iain Buclaw *(p < e ? p++ : p) = (c & 0x0f) + '0'; _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com