On Thu, Mar 18, 2021 at 05:38:08PM -0400, Eric Whitney wrote: > * Matthew Wilcox <willy@xxxxxxxxxxxxx>: > > On Thu, Mar 18, 2021 at 02:16:13PM -0400, Eric Whitney wrote: > > > As mentioned in today's ext4 concall, I've seen generic/418 fail from time to > > > time when run on 5.12-rc3 and 5.12-rc1 kernels. This first occurred when > > > running the 1k test case using kvm-xfstests. I was then able to bisect the > > > failure to a patch landed in the -rc1 merge window: > > > > > > (bd8a1f3655a7) mm/filemap: support readpage splitting a page > > > > Thanks for letting me know. This failure is new to me. > > Sure - it's useful to know that it's new to you. Ted said he's also going > to test XFS with a large number of generic/418 trials which would be a > useful comparison. However, he's had no luck as yet reproducing what I've > seen on his Google compute engine test setup running ext4. > > > > > I don't understand it; this patch changes the behaviour of buffered reads > > from waiting on a page with a refcount held to waiting on a page without > > the refcount held, then starting the lookup from scratch once the page > > is unlocked. I find it hard to believe this introduces a /new/ failure. > > Either it makes an existing failure easier to hit, or there's a subtle > > bug in the retry logic that I'm not seeing. > > > > For keeping Murphy at bay I'm rerunning the bisection from scratch just > to make sure I come out at the same patch. The initial bisection looked > clean, but when dealing with a failure that occurs probabilistically it's > easy enough to get it wrong. Is this patch revertable in -rc1 or -rc3? > Ordinarily I like to do that for confirmation. Alas, not easily. I've built a lot on top of it since then. I could probably come up with a moral reversion (and will have to if we can't figure out why it's causing a problem!) > And there's always the chance that a latent ext4 bug is being hit. That would also be valuable information to find out. If this patch is exposing a latent bug, I can't think what it might be. > I'd be very happy to run whatever debugging patches you might want, though > you might want to wait until I've reproduced the bisection result. The > offsets vary, unfortunately - I've seen 1024, 2048, and 3072 reported when > running a file system with 4k blocks. As I expected, but thank you for being willing to run debug patches. I'll wait for you to confirm the bisection and then work up something that'll help figure out what's going on.