radosgw - stuck ops

GuangYang <yguang11@xxxxxxxxxxx> · Mon, 3 Aug 2015 18:53:20 -0700

Hi Yehuda,
Recently with our pre-production clusters (with radosgw), we had an outage that all radosgw worker threads got stuck and all clients request resulted in 500 because that there is no worker thread taking care of them.

What we observed from the cluster, is that there was a PG stuck at *peering* state, as a result, all requests hitting that PG would occupy a worker thread infinitely and that gradually stuck all workers.

The reason why the PG stuck at peering is still under investigation, but radosgw side, I am wondering if we can pursue anything to improve such use case (to be more specific, 1 out of 8192 PGs' issue cascading to a service unavailable across the entire cluster):

1. The first approach I can think of is to add timeout at objecter layer for each OP to OSD, I think the complexity comes with WRITE, that is, how do we make sure the integrity if we abort at objecter layer. But for immutable op, I think we certainly can do this, since at an upper layer, we already reply back to client with an error.
2. Do thread pool/working queue sharding  at radosgw, in which case, partial failure would (hopefully) only impact partial of worker threads and only cause a partial outage.

How do you think?

Thanks,
Guang 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html