Re: radosgw - stuck ops

Sage Weil <sweil@xxxxxxxxxx> · Tue, 4 Aug 2015 09:55:28 -0700 (PDT)

On Tue, 4 Aug 2015, Yehuda Sadeh-Weinraub wrote:
> On Mon, Aug 3, 2015 at 6:53 PM, GuangYang <yguang11@xxxxxxxxxxx> wrote:
>       Hi Yehuda,
>       Recently with our pre-production clusters (with radosgw), we had
>       an outage that all radosgw worker threads got stuck and all
>       clients request resulted in 500 because that there is no worker
>       thread taking care of them.
> 
>       What we observed from the cluster, is that there was a PG stuck
>       at *peering* state, as a result, all requests hitting that PG
>       would occupy a worker thread infinitely and that gradually stuck
>       all workers.
> 
>       The reason why the PG stuck at peering is still under
>       investigation, but radosgw side, I am wondering if we can pursue
>       anything to improve such use case (to be more specific, 1 out of
>       8192 PGs' issue cascading to a service unavailable across the
>       entire cluster):
> 
>       1. The first approach I can think of is to add timeout at
>       objecter layer for each OP to OSD, I think the complexity comes
>       with WRITE, that is, how do we make sure the integrity if we
>       abort at objecter layer. But for immutable op, I think we
>       certainly can do this, since at an upper layer, we already reply
>       back to client with an error.
>       2. Do thread pool/working queue sharding  at radosgw, in which
>       case, partial failure would (hopefully) only impact partial of
>       worker threads and only cause a partial outage.
> 
> 
> The problem with timeouts is that they are racy and can bring the system
> into inconsistent state. For example, an operation takes too long, rgw gets
> a timeout, but the operation actually completes on the osd. So rgw returns
> with an error, removes the tail and does not complete the write, whereas in
> practice the new head was already written and points at the newly removed
> tail. The index would still show as if the old version of the object was
> still there. I'm sure we can come up with some more scenarios that I'm not
> sure we could resolve easily.

Yeah, unless the entire request goes boom when we time out.  In 
that case, it'd look like a radosgw failure (and the head wouldn't get 
removed, etc.).

This could trivially be done by just setting the suicide timeouts on the 
rgw work queue, but in practice I think that just means all the requests 
will fail (even ones that were making progress at the time) *or* all of 
them will get retried and the list of 'hung' requests will continue to 
pile up (unless the original clients disconnect and the LB/proxy/wahtever 
stops sending them to rgw?).

> The problem with sharding is that for large enough objects they could end up
> writing to any pg, so I'm not sure how effective that would be.

Yeah.

In practice, though, the hung threads aren't consuming any CPU.. they're 
just blocked.  I wonder if rgw could go into a mode where it counts idle 
vs progressing threads, and expands the work queue so that work still gets 
done.  Then/or once it hits some threshold it realizes there's a backend 
hang, it drains requests making progress, and does an orderly restart.

Ideally, we'd have a way to 'kill' a single request process, in which case 
we could time it out and return the appropriate HTTP code to the 
front-end, but in lieu of that... :/

> One solution that I can think of is to determine before the read/write
> whether the pg we're about to access is healthy (or has been unhealthy for a
> short period of time), and if not to cancel the request before sending the
> operation. This could mitigate the problem you're seeing at the expense of
> availability in some cases. We'd need to have a way to query pg health
> through librados which we don't have right now afaik.
> Sage / Sam, does that make sense, and/or possible?

This seems mostly impossible because we don't know ahead of time which 
PG(s) a request is going to touch (it'll generally be a lot of them)?

sage