Thanks for Sage, Yehuda and Sam's quick reply. Given the discussion so far, could I summarize into the following bullet points: 1> The first step we would like to pursue is to implement the following mechanism to avoid infinite waiting at radosgw side: 1.1. radosgw - send OP with a *fast_fail* flag 1.2. OSD - reply with -EAGAIN if the PG is *inactive* and the *fast_fail* flag is set 1.3. radosgw - upon receiving -EAGAIN, retry till a timeout interval is reached (properly with some back-off?), and if it eventually fails, convert -EAGAIN to some other error code and passes to upper layer. 2> In terms of management of radosgw's worker threads, I think we either pursue Sage's proposal (which could linearly increase the time it takes to stuck all worker threads depending how many threads we expand), or simply try sharding work queue (which we already has some basic building block)? Can I start working on patch for <1> and then <2> as a lower priority? Thanks, Guang ---------------------------------------- > Date: Tue, 4 Aug 2015 10:14:06 -0700 > Subject: Re: radosgw - stuck ops > From: ysadehwe@xxxxxxxxxx > To: sweil@xxxxxxxxxx > CC: yguang11@xxxxxxxxxxx; sjust@xxxxxxxxxx; yehuda@xxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx > > On Tue, Aug 4, 2015 at 10:03 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: >> On Tue, 4 Aug 2015, Yehuda Sadeh-Weinraub wrote: >>> On Tue, Aug 4, 2015 at 9:55 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: >>>>> One solution that I can think of is to determine before the read/write >>>>> whether the pg we're about to access is healthy (or has been unhealthy for a >>>>> short period of time), and if not to cancel the request before sending the >>>>> operation. This could mitigate the problem you're seeing at the expense of >>>>> availability in some cases. We'd need to have a way to query pg health >>>>> through librados which we don't have right now afaik. >>>>> Sage / Sam, does that make sense, and/or possible? >>>> >>>> This seems mostly impossible because we don't know ahead of time which >>>> PG(s) a request is going to touch (it'll generally be a lot of them)? >>>> >>> >>> Barring pgls() and such, each rados request that radosgw produces will >>> only touch a single pg, right? >> >> Oh, yeah. I thought you meant before each RGW request. If it's at the >> rados level then yeah, you could avoid stuck pgs, although I think a >> better approach would be to make the OSD reply with -EAGAIN in that case >> so that you know the op didn't happen. There would still be cases (though >> more rare) where you weren't sure if the op happened or not (e.g., when >> you send to osd A, it goes down, you resend to osd B, and then you get >> EAGAIN/timeout). > > If done on the client side then we should only make it apply to the > first request sent. Is it actually a problem if the osd triggered the > error? > >> >> What would you do when you get that failure/timeout, though? Is it >> practical to abort the rgw request handling completely? >> > > It should be like any error that happens through the transaction > (e.g., client disconnection). > > Yehuda > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html