On 10/11/24 5:08 AM, Christian Theune wrote: > >> On 11. Oct 2024, at 09:27, Christian Theune <ct@xxxxxxxxxxxxxxx> wrote: >> >> I’m going to gather a few more instances during the day and will post them as a batch later. > > I’ve received 8 alerts in the last hours and managed to get detailed, repeated walker output from two of them: > > - FC-41287.log > - FC-41289.log These are really helpful. If io throttling were the cause, the traces should also have a process that's waiting to submit the IO, but that's not present here. Another common pattern is hung tasks with a process stuck in the kernel burning CPU, but holding a lock or being somehow responsible for waking the hung task. Your process listings don't have that either. One part I wanted to mention: [820710.974122] Future hung task reports are suppressed, see sysctl kernel.hung_task_warnings By default you only get 10 or so hung task notifications per boot, and after that they are suppressed. So for example, if you're watching a count of hung task messages across a lot of machines and thinking that things are pretty stable because you're not seeing hung task messages anymore...the kernel might have just stopped complaining. This isn't exactly new kernel behavior, but it can be a surprise. Anyway, this leaves me with ~3 theories: - Linus's starvation observation. It doesn't feel like there's enough load to cause this, especially given us sitting in truncate, where it should be pretty unlikely to have multiple procs banging on the page in question. - Willy's folio->mapping check idea. I _think_ this is also wrong, the reference counts we have in the truncate path check folio->mapping before returning, and we shouldn't be able to reuse the folio in a different mapping while we have the reference held. If this is the problem it would mean our original bug is slightly unfixed. But the fact that you're not seeing other problems, and these hung tasks do resolve should mean we're ok. We can add a printk or just run a drgn script to check. - It's actually taking the IO a long time to finish. We can poke at the pending requests, how does the device look in the VM? (virtio, scsi etc). -chris