On 9/13/24 11:51 AM, Matthew Wilcox wrote: > On Fri, Sep 13, 2024 at 11:30:41AM -0400, Chris Mason wrote: >> I've mentioned this in the past to both Willy and Dave Chinner, but so >> far all of my attempts to reproduce it on purpose have failed. It's >> awkward because I don't like to send bug reports that I haven't >> reproduced on a non-facebook kernel, but I'm pretty confident this bug >> isn't specific to us. > > I don't think the bug is specific to you either. It's been hit by > several people ... but it's really hard to hit ;-( > >> I'll double down on repros again during plumbers and hopefully come up >> with a recipe for explosions. On other important datapoint is that we > > I appreciate the effort! > >> The issue looked similar to Christian Theune's rcu stalls, but since it >> was just one CPU spinning away, I was able to perf probe and drgn my way >> to some details. The xarray for the file had a series of large folios: >> >> [ index 0 large folio from the correct file ] >> [ index 1: large folio from the correct file ] >> ... >> [ index N: large folio from a completely different file ] >> [ index N+1: large folio from the correct file ] >> >> I'm being sloppy with index numbers, but the important part is that >> we've got a large folio from the wrong file in the middle of the bunch. > > If you could get the precise index numbers, that would be an important > clue. It would be interesting to know the index number in the xarray > where the folio was found rather than folio->index (as I suspect that > folio->index is completely bogus because folio->mapping is wrong). > But gathering that info is going to be hard. This particular debug session was late at night while we were urgently trying to roll out some NFS features. I didn't really save many of the details because my plan was to reproduce it and make a full bug report. Also, I was explaining the details to people in workplace chat, which is wildly bad at rendering long lines of structured text, especially when half the people in the chat are on a mobile device. You're probably wondering why all of that is important...what I'm really trying to say is that I've attached a screenshot of the debugging output. It came from a older drgn script, where I'm still clinging to "radix", and you probably can't trust the string representation of the page flags because I wasn't yet using Omar's helpers and may have hard coded them from an older kernel. -chris
Attachment:
xarray-debug.png
Description: PNG image