Re: Performance events and auto-down marking

Andrey Korolyov <andrey@xxxxxxx> · Mon, 1 Jul 2013 11:15:35 +0400

On Mon, Jul 1, 2013 at 8:10 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> On Sun, 30 Jun 2013, Andrey Korolyov wrote:
>> Recently I have an issue with OSD process with dying disk under it -
>> disk suddenly started doing cluster remapping so OSD was stale for a
>> couple of minutes. Unfortunately flapping prevention was not
>> triggered, since writes are simply degraded, not frozen. May be it
>> will be worth to introduce self-marking mechanism working in the
>> seperate thread watching on queue of non-flushed operations and
>> raising a flag on long-time watermark crossing, say, minutes. It`ll be
>> helpful in companion of relatively high down_out interval and in very
>> large setups, where one degraded storage can bring entire data
>> placement to the knees(and flaps are not presented by some reason).
>> Right now I may do such job using orchestrator and watching per-socket
>> statistic, but it is not very reliable at all.
>
> There is alaready an internal check that makes the OSD stop heartbeating
> if the internal io thread doesn't make progress for 15 seconds (by
> default, IIRC).  Was the disk making some progress (just very slow)
> preventing this from kicking in?

Yep, exactly. I`m talking on case when storage is not completely stuck
but experiencing large performance drop - none of existing built-in
mechanisms can catch such situation.

>
> sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com