Re: Process stuck in md_flush_request (state: D)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sent from my iPhone

> On Feb 27, 2017, at 7:44 PM, Shaohua Li <shli@xxxxxxxxxx> wrote:
>
>> On Mon, Feb 27, 2017 at 01:48:00PM -0500, Les Stroud wrote:
>>
>>
>>
>>
>>> On Feb 27, 2017, at 1:28 PM, Shaohua Li <shli@xxxxxxxxxx> wrote:
>>>
>>> On Mon, Feb 27, 2017 at 09:49:59AM -0500, Les Stroud wrote:
>>>> After a period of a couple of weeks with one of our test instances having this problem every other day, they were all nice enough to operate without an issue for 9 days.  It finally reoccurred last night on one of the machines.
>>>>
>>>> It exhibits the same symptoms and the call traces look as they did previously.  This particular instance is configured with a deadline scheduler.  I was able to capture the inflight you requested:
>>>>
>>>> $ cat /sys/block/xvd[abcde]/inflight
>>>>        0        0
>>>>        0        0
>>>>        0        0
>>>>        0        0
>>>>        0        0
>>>>
>>>> I’ve had this happen on instances with the deadline scheduler and the noop scheduler.  At this point, I have not had this happen on an instance that is noop and the raid filesystem (ext4) is mounted with nobarrier.  The instances with noop/nobarrier have not been running long enough for me to make any sort of conclusion that it works around the problem. Frankly, I’m not sure I understand the interaction between ext4 barriers and raid0 block flushes well enough to theorize whether it should or shouldn’t make a difference.
>>>
>>> If nobarrier, ext4 doesn't send flush request.
>>
>> So, could ext4’s flush request deadlock with an md_flush_request?  Do they share a mutex of some sort? Could one of them be failing to acquire a mutex and not handling it?
>
> No, it shouldn't deadlock. I don't have other reports for such issue. Yours are the only one.
>
>>>
>>>> Does any of this help with identifying the bug?  Is there anymore information I can get that would be useful?
>>>
>>>
>>> Unfortunately I can't find anything fishing. Does the xcdx disk correctly
>>> handle flush request? For example, you can do the same test with a single such
>>> disk and check if anything wrong.
>>
I'll test a single disk config.


>> Until recently, we had a number of these systems setup without raid0.  This issue never occurred on those systems.  Unfortunately, I can’t find a way to make it happen other than stand a server up and let it run.
>>
>> I suppose I could try a different filesystem and see if that makes a difference (maybe ext3, xfs, etc).
>
> You could format a xcdx disk and do a test against it, and check if there is
> anything wrong. To be honest, I don't think it's a problme in ext4 side too,
> but better try other filesystems. If the xcdx is a proprietory driver, I highly
> recommend a check with a single such disk first.
>

These disks are AWS EBS. So, maybe it is an issue in the xen virtual
driver? I'll see if amazon support can give me any information about
what's happening below the OS.

Is there any other output that might tell me what the process is waiting on?

Thanx,
LES


> Thanks,
> Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux