On Mon, Jul 1, 2013 at 8:10 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: > On Sun, 30 Jun 2013, Andrey Korolyov wrote: >> Recently I have an issue with OSD process with dying disk under it - >> disk suddenly started doing cluster remapping so OSD was stale for a >> couple of minutes. Unfortunately flapping prevention was not >> triggered, since writes are simply degraded, not frozen. May be it >> will be worth to introduce self-marking mechanism working in the >> seperate thread watching on queue of non-flushed operations and >> raising a flag on long-time watermark crossing, say, minutes. It`ll be >> helpful in companion of relatively high down_out interval and in very >> large setups, where one degraded storage can bring entire data >> placement to the knees(and flaps are not presented by some reason). >> Right now I may do such job using orchestrator and watching per-socket >> statistic, but it is not very reliable at all. > > There is alaready an internal check that makes the OSD stop heartbeating > if the internal io thread doesn't make progress for 15 seconds (by > default, IIRC). Was the disk making some progress (just very slow) > preventing this from kicking in? Yep, exactly. I`m talking on case when storage is not completely stuck but experiencing large performance drop - none of existing built-in mechanisms can catch such situation. > > sage _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com