Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sat, 24 Feb 2024 15:40:48 -0800

On Sat, 24 Feb 2024 at 14:58, Chris Mason <clm@xxxxxxxx> wrote:
>
> For teams that really more control over dirty pages with existing APIs,
> I've suggested using sync_file_range periodically.  It seems to work
> pretty well, and they can adjust the sizes and frequency as needed.

Yes. I've written code like that myself.

That said, that is also fairly close to what the write-behind patches
I pointed at did.

One issue (and maybe that was what killed that write-behind patch) is
that there are *other* benchmarks that are actually slightly more
realistic that do things like "untar a tar-file, do something with it,
and them 'rm -rf' it all again".

And *those* benchmarks behave best when the IO is never ever actually
done at all. And unlike the "write a terabyte with random IO", those
benchmarks actually approximate a few somewhat real loads (I'm not
claiming they are good, but the "create files, do something, then
remove them" pattern at least _exists_ in real life).

For things like block device write for a 'mkfs' run, the whole "this
file may be deleted soon, so let's not even start the write in the
first place" behavior doesn't exist, of course. Starting writeback
much more aggressively for those is probably not a bad idea.

> From time to time, our random crud that maintains the system will need a
> lot of memory and kswapd will saturate a core, but this tends to resolve
> itself after 10-20 seconds.  Our ultra sensitive workloads would
> complain, but they manage the page cache more explicitly to avoid these
> situations.

You can see these things with slow USB devices with much more obvious
results. Including long spikes of total inactivity if some system
piece ends up doing a "sync" for some reason. It happens. It's very
annoying.

My gut feel is that it happens a lot less these days than it used to,
but I suspect that's at least partly because I don't see the slow USB
devices very much any more.

> Ignoring widly slow devices, the dirty limits seem to work well enough
> on both big and small systems that I haven't needed to investigate
> issues there as often.

One particular problem point used to be backing devices with wildly
different IO throughput, because I think the speed heuristics don't
necessarily always work all that well at least initially.

And things like that may partly explain your "filesystems work better
than block devices".  It doesn't necessarily have to be about
filesystems vs block devices per se, and be instead about things like
"on a filesystem, the bdi throughput numbers have had time to
stabilize".

In contrast, a benchmark that uses soem other random device that
doesn't look like a regular disk (whether it's really slow like a bad
USB device, or really fast like pmem), you might see more issues. And
I wouldn't be in the least surprised if that is part of the situation
Luis sees.

              Linus