Gateway timeout

"Brent Kennedy" <bkennedy@xxxxxxxxxx> · Thu, 21 Dec 2017 09:48:10 -0500

I have noticed over the years ( been using ceph since 2013 ) that when an OSD attached to a single physical drive ( JBOD setup ) that is failing, that at times this will cause rados gateways to go offline.  I have two clusters running ( one on firefly and one on hammer, both scheduled for upgrades next year ) and it happens on both when a drive is not marked out but has many blocked ops requests.  The drive is physically still functioning but is most likely failing, just not failed yet.  The issue is that the gateways will just stop responding to all requests.  Both of our clusters have 3 rados gateways behind a haproxy load balancer, so we know immediately when they drop.  This will occur continually until we out the failing OSD ( normally we restart the gateways or the services on them first, then move to out the drive ).  

Wonder if anyone runs into this, a quick search revealed one hit with no actual resolution.  Also wondering if there is some way I could prevent the gateways from falling over due to the unresponsive OSD?

I did setup a test Jewel install in our dev and semi-recreate the problem by shutting down all the OSDs.  This resulted in the gateway going down completely as well.  I imagine taking the OSDs offline like that wouldn’t be expected though.  It would be nice if the gateway would just throw a message back, like service unavailable.  I suppose haproxy is doing this for it though…

Regards,
Brent
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com