On Thu, Mar 18, 2021 at 02:16:13PM -0400, Eric Whitney wrote: > As mentioned in today's ext4 concall, I've seen generic/418 fail from time to > time when run on 5.12-rc3 and 5.12-rc1 kernels. This first occurred when > running the 1k test case using kvm-xfstests. I was then able to bisect the > failure to a patch landed in the -rc1 merge window: > > (bd8a1f3655a7) mm/filemap: support readpage splitting a page Thanks for letting me know. This failure is new to me. I don't understand it; this patch changes the behaviour of buffered reads from waiting on a page with a refcount held to waiting on a page without the refcount held, then starting the lookup from scratch once the page is unlocked. I find it hard to believe this introduces a /new/ failure. Either it makes an existing failure easier to hit, or there's a subtle bug in the retry logic that I'm not seeing. > Typical test output resulting from a failure looks like: > > QA output created by 418 > +cmpbuf: offset 0: Expected: 0x1, got 0x0 > +[6:0] FAIL - comparison failed, offset 3072 > +diotest -w -b 512 -n 8 -i 4 failed at loop 0 > Silence is golden > ... > > I've also been able to reproduce the failure on -rc3 in the 4k test case as > well. The failure frequency there was 10 out of 100 runs. It was anywhere > from 2 to 8 failures out of 100 runs in the 1k case. > > So, the failure isn't dependent upon block size less than page size. That's a good data point. I'll take a look at g/418 and see if i can figure out what race we're hitting. Nice that it happens so often. I suppose I could get you to put some debugging in -- maybe dumping the page if we hit a contended case, then again if we're retrying? I presume it doesn't always happen at the same offset or anything convenient like that.