On 9/12/24 6:25 PM, Linus Torvalds wrote: > On Thu, 12 Sept 2024 at 15:12, Jens Axboe <axboe@xxxxxxxxx> wrote: >> >> When I saw Christian's report, I seemed to recall that we ran into this >> at Meta too. And we did, and hence have been reverting it since our 5.19 >> release (and hence 6.4, 6.9, and 6.11 next). We should not be shipping >> things that are known broken. > > I do think that if we have big sites just reverting it as known broken > and can't figure out why, we should do so upstream too. I've mentioned this in the past to both Willy and Dave Chinner, but so far all of my attempts to reproduce it on purpose have failed. It's awkward because I don't like to send bug reports that I haven't reproduced on a non-facebook kernel, but I'm pretty confident this bug isn't specific to us. I'll double down on repros again during plumbers and hopefully come up with a recipe for explosions. On other important datapoint is that we also enable huge folios via tmpfs mount -o huge=within_size. That hasn't hit problems, and we've been doing it for years, but of course the tmpfs usage is pretty different from iomap/xfs. We have two workloads that have reliably seen large folios bugs in prod. This is all on bare metal systems, some are two socket, some single, nothing really exotic. 1) On 5.19 kernels, knfsd reading and writing to XFS. We needed O(hundreds) of knfsd servers running for about 8 hours to see one hit. The issue looked similar to Christian Theune's rcu stalls, but since it was just one CPU spinning away, I was able to perf probe and drgn my way to some details. The xarray for the file had a series of large folios: [ index 0 large folio from the correct file ] [ index 1: large folio from the correct file ] ... [ index N: large folio from a completely different file ] [ index N+1: large folio from the correct file ] I'm being sloppy with index numbers, but the important part is that we've got a large folio from the wrong file in the middle of the bunch. filemap_read() iterates over batches of folios from the xarray, but if one of the folios in the batch has folio->offset out of order with the rest, the whole thing turns into a infinite loop. It's not really a filemap_read() bug, the batch coming back from the xarray is just incorrect. 2) On 6.9 kernels, we saw a BUG_ON() during inode eviction because mapping->nrpages was non-zero. I'm assuming it's really just a different window into the same bug. Crash dump analysis was less conclusive because the xarray itself was always empty, but turning off large folios made the problem go away. This happened ~5-10 times a day, and the service had a few thousand machines running 6.9. If I can't make an artificial repro, I'll try and talk the service owners into setting up a production shadow to hammer on it with additional debugging. We also disabled large folios for our 6.4 kernel, but Stefan actually tracked that bug down: commit a48d5bdc877b85201e42cef9c2fdf5378164c23a Author: Stefan Roesch <shr@xxxxxxxxxxxx> Date: Mon Nov 6 10:19:18 2023 -0800 mm: fix for negative counter: nr_file_hugepages We didn't have time to revalidate with large folios back on afterwards. -chris