On Mon, Feb 27, 2017 at 01:48:00PM -0500, Les Stroud wrote: > > > > > > On Feb 27, 2017, at 1:28 PM, Shaohua Li <shli@xxxxxxxxxx> wrote: > > > > On Mon, Feb 27, 2017 at 09:49:59AM -0500, Les Stroud wrote: > >> After a period of a couple of weeks with one of our test instances having this problem every other day, they were all nice enough to operate without an issue for 9 days. It finally reoccurred last night on one of the machines. > >> > >> It exhibits the same symptoms and the call traces look as they did previously. This particular instance is configured with a deadline scheduler. I was able to capture the inflight you requested: > >> > >> $ cat /sys/block/xvd[abcde]/inflight > >> 0 0 > >> 0 0 > >> 0 0 > >> 0 0 > >> 0 0 > >> > >> I’ve had this happen on instances with the deadline scheduler and the noop scheduler. At this point, I have not had this happen on an instance that is noop and the raid filesystem (ext4) is mounted with nobarrier. The instances with noop/nobarrier have not been running long enough for me to make any sort of conclusion that it works around the problem. Frankly, I’m not sure I understand the interaction between ext4 barriers and raid0 block flushes well enough to theorize whether it should or shouldn’t make a difference. > > > > If nobarrier, ext4 doesn't send flush request. > > So, could ext4’s flush request deadlock with an md_flush_request? Do they share a mutex of some sort? Could one of them be failing to acquire a mutex and not handling it? No, it shouldn't deadlock. I don't have other reports for such issue. Yours are the only one. > > > >> Does any of this help with identifying the bug? Is there anymore information I can get that would be useful? > > > > > > Unfortunately I can't find anything fishing. Does the xcdx disk correctly > > handle flush request? For example, you can do the same test with a single such > > disk and check if anything wrong. > > Until recently, we had a number of these systems setup without raid0. This issue never occurred on those systems. Unfortunately, I can’t find a way to make it happen other than stand a server up and let it run. > > I suppose I could try a different filesystem and see if that makes a difference (maybe ext3, xfs, etc). You could format a xcdx disk and do a test against it, and check if there is anything wrong. To be honest, I don't think it's a problme in ext4 side too, but better try other filesystems. If the xcdx is a proprietory driver, I highly recommend a check with a single such disk first. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html