RE: radosgw - stuck ops

GuangYang <yguang11@xxxxxxxxxxx> · Tue, 4 Aug 2015 09:48:25 -0700

Hi Yehuda,
Thanks for the quick response. My comments inline..

Thanks,
Guang
________________________________
> Date: Tue, 4 Aug 2015 08:41:26 -0700 
> Subject: Re: radosgw - stuck ops 
> From: ysadehwe@xxxxxxxxxx 
> To: yguang11@xxxxxxxxxxx; sweil@xxxxxxxxxx; sjust@xxxxxxxxxx 
> CC: yehuda@xxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx 
> 
> 
> 
> On Mon, Aug 3, 2015 at 6:53 PM, GuangYang 
> <yguang11@xxxxxxxxxxx<mailto:yguang11@xxxxxxxxxxx>> wrote: 
> Hi Yehuda, 
> Recently with our pre-production clusters (with radosgw), we had an 
> outage that all radosgw worker threads got stuck and all clients 
> request resulted in 500 because that there is no worker thread taking 
> care of them. 
> 
> What we observed from the cluster, is that there was a PG stuck at 
> *peering* state, as a result, all requests hitting that PG would occupy 
> a worker thread infinitely and that gradually stuck all workers. 
> 
> The reason why the PG stuck at peering is still under investigation, 
> but radosgw side, I am wondering if we can pursue anything to improve 
> such use case (to be more specific, 1 out of 8192 PGs' issue cascading 
> to a service unavailable across the entire cluster): 
> 
> 1. The first approach I can think of is to add timeout at objecter 
> layer for each OP to OSD, I think the complexity comes with WRITE, that 
> is, how do we make sure the integrity if we abort at objecter layer. 
> But for immutable op, I think we certainly can do this, since at an 
> upper layer, we already reply back to client with an error. 
> 2. Do thread pool/working queue sharding at radosgw, in which case, 
> partial failure would (hopefully) only impact partial of worker threads 
> and only cause a partial outage. 
> 
> 
> The problem with timeouts is that they are racy and can bring the 
> system into inconsistent state. For example, an operation takes too 
> long, rgw gets a timeout, but the operation actually completes on the 
> osd. So rgw returns with an error, removes the tail and does not 
> complete the write, whereas in practice the new head was already 
> written and points at the newly removed tail. The index would still 
> show as if the old version of the object was still there. I'm sure we 
> can come up with some more scenarios that I'm not sure we could resolve 
> easily. 
Right, that is my concern as well, we will need to come up with a mechanism to
preserve integrity, like for each write, it should be all or nothing, not partial, though
we already reply to client with a 500 error.
But that is the problem we properly need to deal with anyway, for example, in our
cluster, each time we detect this kind of availability issue, we will need to restart
all radosgw daemons to bring it back, which has the possibility to leave some
inconsistent state.
I am thinking it might make sense to start with *immutable* requests, for example,
bucket listing, object get/head, etc. We can timeout as long as we timeout with client.
That should be much easier to implement and solve part of the problem.
> The problem with sharding is that for large enough objects they could 
> end up writing to any pg, so I'm not sure how effective that would be.
Not sure of other use cases with radosgw across the community, but for us, at
time being, 95%tile of the objects are stored with one chunk, so that should be
effective for this kind of work load, but yeah we should consider to support more
general use cases. As a bottom line, that should not make things worse.
> One solution that I can think of is to determine before the read/write 
> whether the pg we're about to access is healthy (or has been unhealthy 
> for a short period of time), and if not to cancel the request before 
> sending the operation. This could mitigate the problem you're seeing at 
> the expense of availability in some cases. We'd need to have a way to 
> query pg health through librados which we don't have right now afaik. 
That sounds good. The only complexity I can think of is for large objects
which has several chunks, we will need to deal with the write issue as well, since each chunk
might assign to different PGs?
> Sage / Sam, does that make sense, and/or possible? 
> 
> Yehuda 
 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html