Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sat, 24 Feb 2024 09:31:44 -0800

On Fri, 23 Feb 2024 at 20:12, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> >  What are the limits to buffered IO
> > and how do we test that? Who keeps track of it?
>
> TLDR: Why does the pagecache suck?

What? No.

Our page cache is so good that the question is literally "what are the
limits of it", and "how we would measure them".

That's not a sign of suckage.

When you have to have completely unrealistic loads that nobody would
actually care about in reality just to get a number for the limit,
it's not a sign of problems.

Or rather, the "problem" is the person looking at a stupid load, and
going "we need to improve this because I can write a benchmark for
this".

Here's a clue: a hardware discussion forum I visit was arguing about
memory latencies, and talking about how their measured overhead of
DRAM latency was literally 85% on the CPU side, not the DRAM side.

Guess what? It's because the CPU in question had quite a bit of L3,
and it was spread out, and the CPU doesn't even start the memory
access before it has checked caches.

And here's a big honking clue: only a complete nincompoop and mentally
deficient rodent would look at that and say "caches suck".

> >  ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64
> >      Vs
> >  ~ 7,000 MiB/s with buffered IO
>
> Profile?  My guess is that you're bottlenecked on the xa_lock between
> memory reclaim removing folios from the page cache and the various
> threads adding folios to the page cache.

I doubt it's the locking.

In fact, for writeout in particular it's probably not even the page
cache at all.

For writeout, we have a very traditional problem: we care about a
million times more about latency than we care about throughput,
because nobody ever actually cares all that much about performance of
huge writes.

Ask yourself when you have last *really* sat there waiting for writes,
unless it's some dog-slow USB device that writes at 100kB/s?

The main situation where people care about cached write performance
(ignoring silly benchmarks) tends to be when you create files, and the
directory entry ordering means that the bottleneck is a number of
small writes and their *ordering* and their latency.

And then the issue is basically never the page cache, but the
filesystem ordering of the metadata writes against each other and
against the page writeout.

Why? Because on all but a *miniscule* percentage of loads, all the
actual data writes are quite gracefully taken by the page cache
completely asynchronously, and nobody ever cares about the writeout
latencies.

Now, the benchmark that Luis highlighted is a completely different
class of historical problems that has been around forever, namely the
"fill up lots of memory with dirty data".

And there - because the problem is easy to trigger but nobody tends to
care deeply about throughput because they care much much *MUCH* more
about latency, we have a rather stupid big hammer approach.

It's called "vm_dirty_bytes".

Well, that's the knob (not the only one). The actual logic around it
is then quite the moreass of turning that into the
dirty_throttle_control, and the per-bdi dirty limits that try to take
the throughput of the backing device into account etc etc.

And then all those heuristics are used to actually LITERALLY PAUSE the
writer. We literally have this code:

                __set_current_state(TASK_KILLABLE);
                bdi->last_bdp_sleep = jiffies;
                io_schedule_timeout(pause);

in balance_dirty_pages(), which is all about saying "I'm putting you
to sleep, because I judge you to have dirtied so much memory that
you're making things worse for others".

And a lot of *that* is then because we haven't wanted everybody to
rush in and start their own synchronous writeback, but instead watn
all writeback to be done by somebody else. So now we move from
mm/page-writeback.c to fs/fs-writeback.c, and all the work-queues to
do dirty writeout.

Notice how the io_schedule_timeout() above doesn't even get woken up
by IO completing. Nope. The "you have written too much" logic
literally pauses the writer, and doesn't even want to wake it up when
there is no more dirty data.

So the "you went over the dirty limits It's a penalty box, and all of
this comes from "you are doing something that is abnormal and that
disturbs other people, so you get an unconditional penalty". Yes, the
timeout is then obviously tied to how much of a problem the dirtying
is (based on that whole "how fast is the device") but it's purely a
heuristic.

And (one) important part here is "nobody sane does that".  So
benchmarking this is a bit crazy. The code is literally meant for bad
actors, and what you are benchmarking is the kernel telling you "don't
do that then".

And absolutely *NONE* of this all has anything to do with the page cache. NADA.

And yes, there's literally thousands of lines of code all explicitly
designed for this "slow down writers" and make it be at least somewhat
graceful and gradual.

That's pretty much all mm/page-writeback.c does (yes, that file *also*
does have the "start/end folio writeback" functions, but they are only
a small part of it, even if that's obviously the origin of the file -
the writeback throttling logic has just grown a lot more).

               Linus