On Tue, 4 Aug 2015, Yehuda Sadeh-Weinraub wrote: > On Mon, Aug 3, 2015 at 6:53 PM, GuangYang <yguang11@xxxxxxxxxxx> wrote: > Hi Yehuda, > Recently with our pre-production clusters (with radosgw), we had > an outage that all radosgw worker threads got stuck and all > clients request resulted in 500 because that there is no worker > thread taking care of them. > > What we observed from the cluster, is that there was a PG stuck > at *peering* state, as a result, all requests hitting that PG > would occupy a worker thread infinitely and that gradually stuck > all workers. > > The reason why the PG stuck at peering is still under > investigation, but radosgw side, I am wondering if we can pursue > anything to improve such use case (to be more specific, 1 out of > 8192 PGs' issue cascading to a service unavailable across the > entire cluster): > > 1. The first approach I can think of is to add timeout at > objecter layer for each OP to OSD, I think the complexity comes > with WRITE, that is, how do we make sure the integrity if we > abort at objecter layer. But for immutable op, I think we > certainly can do this, since at an upper layer, we already reply > back to client with an error. > 2. Do thread pool/working queue sharding at radosgw, in which > case, partial failure would (hopefully) only impact partial of > worker threads and only cause a partial outage. > > > The problem with timeouts is that they are racy and can bring the system > into inconsistent state. For example, an operation takes too long, rgw gets > a timeout, but the operation actually completes on the osd. So rgw returns > with an error, removes the tail and does not complete the write, whereas in > practice the new head was already written and points at the newly removed > tail. The index would still show as if the old version of the object was > still there. I'm sure we can come up with some more scenarios that I'm not > sure we could resolve easily. Yeah, unless the entire request goes boom when we time out. In that case, it'd look like a radosgw failure (and the head wouldn't get removed, etc.). This could trivially be done by just setting the suicide timeouts on the rgw work queue, but in practice I think that just means all the requests will fail (even ones that were making progress at the time) *or* all of them will get retried and the list of 'hung' requests will continue to pile up (unless the original clients disconnect and the LB/proxy/wahtever stops sending them to rgw?). > The problem with sharding is that for large enough objects they could end up > writing to any pg, so I'm not sure how effective that would be. Yeah. In practice, though, the hung threads aren't consuming any CPU.. they're just blocked. I wonder if rgw could go into a mode where it counts idle vs progressing threads, and expands the work queue so that work still gets done. Then/or once it hits some threshold it realizes there's a backend hang, it drains requests making progress, and does an orderly restart. Ideally, we'd have a way to 'kill' a single request process, in which case we could time it out and return the appropriate HTTP code to the front-end, but in lieu of that... :/ > One solution that I can think of is to determine before the read/write > whether the pg we're about to access is healthy (or has been unhealthy for a > short period of time), and if not to cancel the request before sending the > operation. This could mitigate the problem you're seeing at the expense of > availability in some cases. We'd need to have a way to query pg health > through librados which we don't have right now afaik. > Sage / Sam, does that make sense, and/or possible? This seems mostly impossible because we don't know ahead of time which PG(s) a request is going to touch (it'll generally be a lot of them)? sage