Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 26 Feb 2024 09:17:33 -0800

On Sun, 25 Feb 2024 at 18:49, Al Viro <viro@xxxxxxxxxx> wrote:
>
> Uh?  __generic_file_write_iter() doesn't take inode lock, but
> generic_file_write_iter() does.

Yes, as I replied to Kent later, I mis-remembered the details
(together with a too-quick grep). It's the reading side that just
doesn't care, and which causes us to not actually give the POSIX
guarantees anyway.

>  O_APPEND handling aside, there's
> also file_remove_privs() in there and it does need that lock.

Fair enough, but that's such a rare special case that it really
doesn't merit the lock on every write.

What at least the ext4 the DIO code does is something like "look at
the write position and the inode size, and decide optimistically
whether to take the lock shared or not". Then it re-checks it after
potentially taking it for a read (in case the inode size has changed),
and might decide to go for a write lock after all..

And I think we could fairly trivially do something similar on the
regular write side. Yes, it needs to check SUID/SGID bits in addition
to the inode size change, I guess (I don't think the ext4 dio code
does, but my quick grepping might once again be incomplete).

Anyway, DaveC is obviously also right that for the "actually need to
do writeback" case, our writeback is currently intentionally
throttled, which is why the benchmark by Luis shows that "almost two
orders of magnitude" slowdown with buffered writeback. That's probably
mainly an effect of having a backing store with no delays, but still..

However, the reason I dislike our write-side locking is that it
actually serializes even the entirely cached case.

Now, writing concurrently to the same inode is obviously strange, but
it's a common thing for databases. And while all the *serious* ones
use DIO, I think the buffered case really should do better.

Willy - tangential side note: I looked closer at the issue that you
reported (indirectly) with the small reads during heavy write
activity.

Our _reading_ side is very optimized and has none of the write-side
oddities that I can see, and we just have

  filemap_read ->
    filemap_get_pages ->
        filemap_get_read_batch ->
          folio_try_get_rcu()

and there is no page locking or other locking involved (assuming the
page is cached and marked uptodate etc, of course).

So afaik, it really is just that *one* atomic access (and the matching
page ref decrement afterwards).

We could easily do all of this without getting any ref to the page at
all if we did the page cache release with RCU (and the user copy with
"copy_to_user_atomic()").  Honestly, anything else looks like a
complete disaster. For tiny reads, a temporary buffer sounds ok, but
really *only* for tiny reads where we could have that buffer on the
stack.

Are tiny reads (handwaving: 100 bytes or less) really worth optimizing
for to that degree?

In contrast, the RCU-delaying of the page cache might be a good idea
in general. We've had other situations where that would have been
nice. The main worry would be low-memory situations, I suspect.

The "tiny read" optimization smells like a benchmark thing to me. Even
with the cacheline possibly bouncing, the system call overhead for
tiny reads (particularly with all the mitigations) should be orders of
magnitude higher than two atomic accesses.

           Linus