Downstream IO circuit-breaker in RGW ?

Rolland Santimano <rolland.s@xxxxxxxxxxxx> · Wed, 28 Jun 2017 07:07:23 +0530

(Please retain the CC list in your replies)

Our Ceph deployment is a S3 service with an SSD index pool, and HDD
data pool. We often see service outages due to blocked requests
against latent OSDs, mostly at the index pool.

I have been looking at code-changes in the RGW IO path that fence-off
latent OSDs or fast-fail IOs targeted to such OSDs; ie. something like
a circuit breaker pattern. A "retry-after" header is inserted in user
responses for such failed user requests.

The above circuit-breaker uses local knowledge at each RGW, ie. there
is no central state about latent OSDs at the MON or elsewhere -- maybe
this is something that can be piggy-backed on the OSD map maintained
by the MON, or pushed to the ceph-mgr.

Any thoughts or suggestions on the above ?

(I was not sure about the folks to target this mail to, please
re-direct as appropriate.)

-- 
Rolland
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html