Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Chris Mason <clm@xxxxxxxx> · Sat, 24 Feb 2024 17:57:43 -0500

On 2/24/24 2:11 PM, Linus Torvalds wrote:
> On Sat, 24 Feb 2024 at 10:20, Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>>
>> If somebody really cares about this kind of load, and cannot use
>> O_DIRECT for some reason ("I actually do want caches 99% of the
>> time"), I suspect the solution is to have some slightly gentler way to
>> say "instead of the throttling logic, I want you to start my writeouts
>> much more synchronously".
>>
>> IOW, we could have a writer flag that still uses the page cache, but
>> that instead of that
>>
>>                 balance_dirty_pages_ratelimited(mapping);
> 
> I was *sure* we had had some work in this area, and yup, there's a
> series from 2019 by Konstantin Khlebnikov to implement write-behind.
> 
> Some digging in the lore archives found this
> 
>     https://lore.kernel.org/lkml/156896493723.4334.13340481207144634918.stgit@buzz/
> 
> but I don't remember what then happened to it.  It clearly never went
> anywhere, although I think something _like_ that is quite possibly the
> right thing to do (and I was fairly positive about the patch at the
> time).
> 
> I have this feeling that there's been other attempts of write-behind
> in this area, but that thread was the only one I found from my quick
> search.
> 
> I'm not saying Konstanti's patch is the thing to do, and I suspect we
> might want to actually have some way for people to say at open-time
> that "I want write-behind", but it looks like at least a starting
> point.
> 
> But it is possible that this work never went anywhere exactly because
> this is such a rare case. That kind of "write so much that you want to
> do something special" is often such a special thing that using
> O_DIRECT is generally the trivial solution.

For teams that really more control over dirty pages with existing APIs,
I've suggested using sync_file_range periodically.  It seems to work
pretty well, and they can adjust the sizes and frequency as needed.

Managing clean pages has been a problem with workloads that really care
about p99 allocation latency.  We've had issues where kswapd saturates a
core throwing away all the clean pages from either streaming readers or
writers.

To reproduce on 6.8-rc5, I did buffered IO onto a 6 drive raid0 via MD.
Max possible tput seems to be 8GB/s writes, and the box has 256GB of ram
across two sockets.  For buffered IO onto md0, we're hitting about
1.2GB/s, and have a core saturated by a kworker doing writepages.

>From time to time, our random crud that maintains the system will need a
lot of memory and kswapd will saturate a core, but this tends to resolve
itself after 10-20 seconds.  Our ultra sensitive workloads would
complain, but they manage the page cache more explicitly to avoid these
situations.

The raid0 is fast enough that we never hit the synchronous dirty page
limit.  fio is just 100% CPU bound, and when kswapd saturates a core,
it's just freeing clean pages.

With filesystems in use, kswapd and the writepages kworkers are better
behaved, which just makes me think writepages on blockdevices have seen
less optimization, not really a huge surprise.  Filesystems can push the
full 8GB/s tput either buffered or O_DIRECT.

With streaming writes to a small number of large files, total free
memory might get down to 1.5GB on the 256GB machine, with most of the
rest being clean page cache.

If I instead write to millions of 1MB files, free memory refuses to go
below 12GB, and kswapd doesn't misbehave at all.  We're still pushing
7GB/s writes.

Not a lot of conclusions, other than it's not that hard to use clean
page cache to make the system slower than some workloads are willing to
tolerate.

Ignoring widly slow devices, the dirty limits seem to work well enough
on both big and small systems that I haven't needed to investigate
issues there as often.

Going back to Luis's original email, I'd echo Willy's suggestion for
profiles.  Unless we're saturating memory bandwidth, buffered should be
able to get much closer to O_DIRECT, just at a much higher overall cost.

-chris