Re: Performance events and auto-down marking

Sage Weil <sage@xxxxxxxxxxx> · Sun, 30 Jun 2013 21:10:33 -0700 (PDT)

On Sun, 30 Jun 2013, Andrey Korolyov wrote:
> Recently I have an issue with OSD process with dying disk under it -
> disk suddenly started doing cluster remapping so OSD was stale for a
> couple of minutes. Unfortunately flapping prevention was not
> triggered, since writes are simply degraded, not frozen. May be it
> will be worth to introduce self-marking mechanism working in the
> seperate thread watching on queue of non-flushed operations and
> raising a flag on long-time watermark crossing, say, minutes. It`ll be
> helpful in companion of relatively high down_out interval and in very
> large setups, where one degraded storage can bring entire data
> placement to the knees(and flaps are not presented by some reason).
> Right now I may do such job using orchestrator and watching per-socket
> statistic, but it is not very reliable at all.

There is alaready an internal check that makes the OSD stop heartbeating 
if the internal io thread doesn't make progress for 15 seconds (by 
default, IIRC).  Was the disk making some progress (just very slow) 
preventing this from kicking in?

sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com