RE: radosgw - stuck ops

GuangYang <yguang11@xxxxxxxxxxx> · Mon, 10 Aug 2015 17:55:27 -0700

Hi Yehuda,
On top of the changes for [1], I would propose another change, which exposes the number of *stuck threads* via admin socket, so that we can build something outside of ceph to check if all worker threads are stuck, and if yes, restart the service.

We can also assertion out if all workers are stuck, as an inside ceph solution... (as being more conservative to use 'hit_suicide_timeout' to assert out if there is one thread stuck).

What do you think?

[1] https://github.com/ceph/ceph/pull/5501
Thanks,
Guang

----------------------------------------
> Date: Wed, 5 Aug 2015 07:44:34 -0700
> Subject: Re: radosgw - stuck ops
> From: ysadehwe@xxxxxxxxxx
> To: yguang11@xxxxxxxxxxx
> CC: sweil@xxxxxxxxxx; sjust@xxxxxxxxxx; yehuda@xxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx
>
> On Tue, Aug 4, 2015 at 3:23 PM, GuangYang <yguang11@xxxxxxxxxxx> wrote:
>> Thanks for Sage, Yehuda and Sam's quick reply.
>>
>> Given the discussion so far, could I summarize into the following bullet points:
>>
>> 1> The first step we would like to pursue is to implement the following mechanism to avoid infinite waiting at radosgw side:
>> 1.1. radosgw - send OP with a *fast_fail* flag
>> 1.2. OSD - reply with -EAGAIN if the PG is *inactive* and the *fast_fail* flag is set
>> 1.3. radosgw - upon receiving -EAGAIN, retry till a timeout interval is reached (properly with some back-off?), and if it eventually fails, convert -EAGAIN to some other error code and passes to upper layer.
>
> I'm not crazy about the 'fast_fail' name, maybe we can come up with a
> better describing term. Also, not 100% sure the EAGAIN is the error we
> want to see. Maybe the flag on the request could specify what would be
> the error code to return in this case?
> I think it's a good plan to start with, we can adjust things later.
>
>>
>> 2> In terms of management of radosgw's worker threads, I think we either pursue Sage's proposal (which could linearly increase the time it takes to stuck all worker threads depending how many threads we expand), or simply try sharding work queue (which we already has some basic building block)?
>
> The problem that I see with that proposal (missed it earlier, only
> seeing it now), is that when the threads actually wake up the system
> could become unusable. In any case, it's probably a lower priority at
> this point, we could rethink this area again later.
>
> Yehuda
>
>>
>> Can I start working on patch for <1> and then <2> as a lower priority?
>>
>> Thanks,
>> Guang
>> ----------------------------------------
>>> Date: Tue, 4 Aug 2015 10:14:06 -0700
>>> Subject: Re: radosgw - stuck ops
>>> From: ysadehwe@xxxxxxxxxx
>>> To: sweil@xxxxxxxxxx
>>> CC: yguang11@xxxxxxxxxxx; sjust@xxxxxxxxxx; yehuda@xxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx
>>>
>>> On Tue, Aug 4, 2015 at 10:03 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>>> On Tue, 4 Aug 2015, Yehuda Sadeh-Weinraub wrote:
>>>>> On Tue, Aug 4, 2015 at 9:55 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>>>>>> One solution that I can think of is to determine before the read/write
>>>>>>> whether the pg we're about to access is healthy (or has been unhealthy for a
>>>>>>> short period of time), and if not to cancel the request before sending the
>>>>>>> operation. This could mitigate the problem you're seeing at the expense of
>>>>>>> availability in some cases. We'd need to have a way to query pg health
>>>>>>> through librados which we don't have right now afaik.
>>>>>>> Sage / Sam, does that make sense, and/or possible?
>>>>>>
>>>>>> This seems mostly impossible because we don't know ahead of time which
>>>>>> PG(s) a request is going to touch (it'll generally be a lot of them)?
>>>>>>
>>>>>
>>>>> Barring pgls() and such, each rados request that radosgw produces will
>>>>> only touch a single pg, right?
>>>>
>>>> Oh, yeah. I thought you meant before each RGW request. If it's at the
>>>> rados level then yeah, you could avoid stuck pgs, although I think a
>>>> better approach would be to make the OSD reply with -EAGAIN in that case
>>>> so that you know the op didn't happen. There would still be cases (though
>>>> more rare) where you weren't sure if the op happened or not (e.g., when
>>>> you send to osd A, it goes down, you resend to osd B, and then you get
>>>> EAGAIN/timeout).
>>>
>>> If done on the client side then we should only make it apply to the
>>> first request sent. Is it actually a problem if the osd triggered the
>>> error?
>>>
>>>>
>>>> What would you do when you get that failure/timeout, though? Is it
>>>> practical to abort the rgw request handling completely?
>>>>
>>>
>>> It should be like any error that happens through the transaction
>>> (e.g., client disconnection).
>>>
>>> Yehuda
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
 		 	   		  ?韬{.n?????%??檩??w?{.n????u朕?Ф?塄}?财??j:+v??????2??璀??摺?囤??z夸z罐?+?????w棹f