On Sat, 24 Feb 2024 at 14:58, Chris Mason <clm@xxxxxxxx> wrote: > > For teams that really more control over dirty pages with existing APIs, > I've suggested using sync_file_range periodically. It seems to work > pretty well, and they can adjust the sizes and frequency as needed. Yes. I've written code like that myself. That said, that is also fairly close to what the write-behind patches I pointed at did. One issue (and maybe that was what killed that write-behind patch) is that there are *other* benchmarks that are actually slightly more realistic that do things like "untar a tar-file, do something with it, and them 'rm -rf' it all again". And *those* benchmarks behave best when the IO is never ever actually done at all. And unlike the "write a terabyte with random IO", those benchmarks actually approximate a few somewhat real loads (I'm not claiming they are good, but the "create files, do something, then remove them" pattern at least _exists_ in real life). For things like block device write for a 'mkfs' run, the whole "this file may be deleted soon, so let's not even start the write in the first place" behavior doesn't exist, of course. Starting writeback much more aggressively for those is probably not a bad idea. > From time to time, our random crud that maintains the system will need a > lot of memory and kswapd will saturate a core, but this tends to resolve > itself after 10-20 seconds. Our ultra sensitive workloads would > complain, but they manage the page cache more explicitly to avoid these > situations. You can see these things with slow USB devices with much more obvious results. Including long spikes of total inactivity if some system piece ends up doing a "sync" for some reason. It happens. It's very annoying. My gut feel is that it happens a lot less these days than it used to, but I suspect that's at least partly because I don't see the slow USB devices very much any more. > Ignoring widly slow devices, the dirty limits seem to work well enough > on both big and small systems that I haven't needed to investigate > issues there as often. One particular problem point used to be backing devices with wildly different IO throughput, because I think the speed heuristics don't necessarily always work all that well at least initially. And things like that may partly explain your "filesystems work better than block devices". It doesn't necessarily have to be about filesystems vs block devices per se, and be instead about things like "on a filesystem, the bdi throughput numbers have had time to stabilize". In contrast, a benchmark that uses soem other random device that doesn't look like a regular disk (whether it's really slow like a bad USB device, or really fast like pmem), you might see more issues. And I wouldn't be in the least surprised if that is part of the situation Luis sees. Linus