Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Feb 24, 2024 at 09:31:44AM -0800, Linus Torvalds wrote:
> On Fri, 23 Feb 2024 at 20:12, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> >
> > On Fri, Feb 23, 2024 at 03:59:58PM -0800, Luis Chamberlain wrote:
> > >  What are the limits to buffered IO
> > > and how do we test that? Who keeps track of it?
> >
> > TLDR: Why does the pagecache suck?
> 
> What? No.
> 
> Our page cache is so good that the question is literally "what are the
> limits of it", and "how we would measure them".
> 
> That's not a sign of suckage.
> 
> When you have to have completely unrealistic loads that nobody would
> actually care about in reality just to get a number for the limit,
> it's not a sign of problems.
> 
> Or rather, the "problem" is the person looking at a stupid load, and
> going "we need to improve this because I can write a benchmark for
> this".
> 
> Here's a clue: a hardware discussion forum I visit was arguing about
> memory latencies, and talking about how their measured overhead of
> DRAM latency was literally 85% on the CPU side, not the DRAM side.
> 
> Guess what? It's because the CPU in question had quite a bit of L3,
> and it was spread out, and the CPU doesn't even start the memory
> access before it has checked caches.
> 
> And here's a big honking clue: only a complete nincompoop and mentally
> deficient rodent would look at that and say "caches suck".
> 
> > >  ~86 GiB/s on pmem DIO on xfs with 64k block size, 1024 XFS agcount on x86_64
> > >      Vs
> > >  ~ 7,000 MiB/s with buffered IO
> >
> > Profile?  My guess is that you're bottlenecked on the xa_lock between
> > memory reclaim removing folios from the page cache and the various
> > threads adding folios to the page cache.
> 
> I doubt it's the locking.
> 
> In fact, for writeout in particular it's probably not even the page
> cache at all.
> 
> For writeout, we have a very traditional problem: we care about a
> million times more about latency than we care about throughput,
> because nobody ever actually cares all that much about performance of
> huge writes.

Before large folios, we had people very much bottlenecked by 4k page
overhead on sequential IO; my customer/sponsor was one of them.

Factor of 2 or 3, IIRC; it was _bad_. And when you looked at the
profiles and looked at the filemap.c code it wasn't hard to see why;
we'd walk a radix tree, do an atomic op (get the page), then do a 4k
usercopy... hence the work I did to break up
generic_file_buffered_read() and vectorize it, which was a huge
improvement.

It's definitely less of a factor when post large folios and when we're
talking about workloads that don't fit in cache, but I always wanted to
do a generic version of the vectorized write path that brfs and bcachefs
have.




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux