(Please retain the CC list in your replies) Our Ceph deployment is a S3 service with an SSD index pool, and HDD data pool. We often see service outages due to blocked requests against latent OSDs, mostly at the index pool. I have been looking at code-changes in the RGW IO path that fence-off latent OSDs or fast-fail IOs targeted to such OSDs; ie. something like a circuit breaker pattern. A "retry-after" header is inserted in user responses for such failed user requests. The above circuit-breaker uses local knowledge at each RGW, ie. there is no central state about latent OSDs at the MON or elsewhere -- maybe this is something that can be piggy-backed on the OSD map maintained by the MON, or pushed to the ceph-mgr. Any thoughts or suggestions on the above ? (I was not sure about the folks to target this mail to, please re-direct as appropriate.) -- Rolland -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html