Re: performance regression between 6.1.x and 5.15.x

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 18 May 2023 08:11:32 +1000

On Wed, May 17, 2023 at 09:07:41PM +0800, Wang Yugui wrote:
> > This indicates that 35% of writeback submission CPU is in
> > __folio_start_writeback(), 13% is in folio_clear_dirty_for_io(), 8%
> > is in filemap_get_folios_tag() and only ~8% of CPU time is in the
> > rest of the iomap/XFS code building and submitting bios from the
> > folios passed to it.  i.e.  it looks a lot like writeback is is
> > contending with the incoming write(), IO completion and memory
> > reclaim contexts for access to the page cache mapping and mm
> > accounting structures.
> > 
> > Unfortunately, I don't have access to hardware that I can use to
> > confirm this is the cause, but it doesn't look like it's directly an
> > XFS/iomap issue at this point. The larger batch sizes reduce both
> > memory reclaim and IO completion competition with submission, so it
> > kinda points in this direction.
> > 
> > I suspect we need to start using high order folios in the write path
> > where we have large user IOs for streaming writes, but I also wonder
> > if there isn't some sort of batched accounting/mapping tree updates
> > we could do for all the adjacent folios in a single bio....
> 
> 
> Is there some comment from Matthew Wilcox?
> since it seems a folios problem?

None of these are new "folio problems" - we've known about these
scalability limitations of page-based writeback caching for over 15
years. e.g. from 2006:

https://www.kernel.org/doc/ols/2006/ols2006v1-pages-177-192.pdf

The fundamental problem is the huge number of page cache objects
that buffered IO must handle when moving multiple GB/s to/from
storage devices. Folios offer a way to mitigate that by reducing
the number of page cache objects via using large folios in the
write() path, but we have not enabled that functionality yet.

If you want to look at making the iomap path and filemap_get_folio()
paths allocate high order folios, then that will largely mitigate
the worst of the performance degredation.

Another possible avenue is to batch all the folio updates in the IO
completion path. We currently do that one folio at a time, so a
typical IO might be doing a several dozen (or more) page cache
updates that largely could be done as a single update per IO. Worse
is that these individual updates are typically done under exclusive
locking, so this means the lock holds are no only more frequent than
they need to be, they are also longer than they need to be.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx