Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sat, 24 Feb 2024 10:20:14 -0800

On Sat, 24 Feb 2024 at 09:31, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> And (one) important part here is "nobody sane does that".  So
> benchmarking this is a bit crazy. The code is literally meant for bad
> actors, and what you are benchmarking is the kernel telling you "don't
> do that then".

Side note: one reason why the big hammer approach of "don't do that"
has worked so well is that the few loads that *do* want to do this and
have a valid reason to write large amounts of data in one go are
generally trivially translated to O_DIRECT.

For example, if you actually do things like write disk images etc,
O_DIRECT is lovely and easy - even trivial - to use. You don't even
have to write code for it, you can (and people do) just use 'dd' with
'oflag=direct'. So even trivial shell scripting has access to the
"don't do that then" flag.

In other words, I really think that Luis' benchmark triggers that
kernel "you are doing something actively wrong and stupid" logic. It's
not the kernel trying to optimize writeback. It's the kernel trying to
protect others from stupid loads.

Now, I'm also not saying that you should benchmark this with our
"vm_dirty_bytes" logic disabled. That may indeed help performance on
that benchmark, but you'll just hit other problem spots instead. Once
you fill up lots of memory, other problems become really big and
nasty, so you would then need *other* fixes for those issues.

If somebody really cares about this kind of load, and cannot use
O_DIRECT for some reason ("I actually do want caches 99% of the
time"), I suspect the solution is to have some slightly gentler way to
say "instead of the throttling logic, I want you to start my writeouts
much more synchronously".

IOW, we could have a writer flag that still uses the page cache, but
that instead of that

                balance_dirty_pages_ratelimited(mapping);

in generic_perform_write(), it would actually synchronously *start*
the write, that might work a whole lot better for any load that still
wants to do big streaming writes, but wants to also keep the page
cache component.

                Linus