Re: Process stuck in md_flush_request (state: D)

Shaohua Li <shli@xxxxxxxxxx> · Mon, 27 Feb 2017 16:44:34 -0800

On Mon, Feb 27, 2017 at 01:48:00PM -0500, Les Stroud wrote:
> 
> 
> 
> 
> > On Feb 27, 2017, at 1:28 PM, Shaohua Li <shli@xxxxxxxxxx> wrote:
> > 
> > On Mon, Feb 27, 2017 at 09:49:59AM -0500, Les Stroud wrote:
> >> After a period of a couple of weeks with one of our test instances having this problem every other day, they were all nice enough to operate without an issue for 9 days.  It finally reoccurred last night on one of the machines.  
> >> 
> >> It exhibits the same symptoms and the call traces look as they did previously.  This particular instance is configured with a deadline scheduler.  I was able to capture the inflight you requested:
> >> 
> >> $ cat /sys/block/xvd[abcde]/inflight
> >>        0        0
> >>        0        0
> >>        0        0
> >>        0        0
> >>        0        0
> >> 
> >> I’ve had this happen on instances with the deadline scheduler and the noop scheduler.  At this point, I have not had this happen on an instance that is noop and the raid filesystem (ext4) is mounted with nobarrier.  The instances with noop/nobarrier have not been running long enough for me to make any sort of conclusion that it works around the problem. Frankly, I’m not sure I understand the interaction between ext4 barriers and raid0 block flushes well enough to theorize whether it should or shouldn’t make a difference.
> > 
> > If nobarrier, ext4 doesn't send flush request.
> 
> So, could ext4’s flush request deadlock with an md_flush_request?  Do they share a mutex of some sort? Could one of them be failing to acquire a mutex and not handling it?

No, it shouldn't deadlock. I don't have other reports for such issue. Yours are the only one.

> > 
> >> Does any of this help with identifying the bug?  Is there anymore information I can get that would be useful?  
> > 
> > 
> > Unfortunately I can't find anything fishing. Does the xcdx disk correctly
> > handle flush request? For example, you can do the same test with a single such
> > disk and check if anything wrong.
> 
> Until recently, we had a number of these systems setup without raid0.  This issue never occurred on those systems.  Unfortunately, I can’t find a way to make it happen other than stand a server up and let it run.
> 
> I suppose I could try a different filesystem and see if that makes a difference (maybe ext3, xfs, etc).

You could format a xcdx disk and do a test against it, and check if there is
anything wrong. To be honest, I don't think it's a problme in ext4 side too,
but better try other filesystems. If the xcdx is a proprietory driver, I highly
recommend a check with a single such disk first.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html