Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Sun, 25 Feb 2024 13:10:39 +0000

On Sun, Feb 25, 2024 at 12:18:23AM -0500, Kent Overstreet wrote:
> Before large folios, we had people very much bottlenecked by 4k page
> overhead on sequential IO; my customer/sponsor was one of them.
> 
> Factor of 2 or 3, IIRC; it was _bad_. And when you looked at the
> profiles and looked at the filemap.c code it wasn't hard to see why;
> we'd walk a radix tree, do an atomic op (get the page), then do a 4k
> usercopy... hence the work I did to break up
> generic_file_buffered_read() and vectorize it, which was a huge
> improvement.

There's also the small random 64 byte read case that we haven't optimised
for yet.  That also bottlenecks on the page refcount atomic op.

The proposed solution to that was double-copy; look up the page without
bumping its refcount, copy to a buffer, look up the page again to be
sure it's still there, copy from the buffer to userspace.

Except that can go wrong under really unlikely circumstances.  Look up the
page, page gets freed, page gets reallocated to slab, we copy sensitive
data from it, page gets freed again, page gets reallocated to the same
spot in the file (!), lookup says "yup the same page is there".
We'd need a seqcount or something to be sure the page hasn't moved.