Re: [PATCH v4 0/9] Create large folios in iomap buffered write path

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 11 Jul 2023 10:01:08 +1000

On Mon, Jul 10, 2023 at 03:55:17PM -0700, Luis Chamberlain wrote:
> On Mon, Jul 10, 2023 at 02:02:44PM +0100, Matthew Wilcox (Oracle) wrote:
> > Commit ebb7fb1557b1 limited the length of ioend chains to 4096 entries
> > to improve worst-case latency.  Unfortunately, this had the effect of
> > limiting the performance of:
> > 
> > fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \
> >         -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \
> >         -numjobs=4 -directory=/mnt/test
> 
> When you say performance, do you mean overall throughput / IOPS /
> latency or all?
> 
> And who noticed it / reported it?

https://lore.kernel.org/linux-xfs/20230508172406.1CF3.409509F4@xxxxxxxxxxxx/

> The above incantation seems pretty
> specific so I'm curious who runs that test and what sort of work flow
> is it trying to replicate.

Not specific at all. It's just a basic concurrent sequential
buffered write performance test. It needs multiple jobs to max out
typical cheap pcie 4.0 NVMe SSD storage (i.e. 6-8GB/s) because
sequential non-async buffered writes are CPU bound at (typically)
2-3GB/s per file write.

> > The problem ends up being lock contention on the i_pages spinlock as we
> > clear the writeback bit on each folio (and propagate that up through
> > the tree).  By using larger folios, we decrease the number of folios
> > to be processed by a factor of 256 for this benchmark, eliminating the
> > lock contention.
> 
> Implied here seems to suggest that the associated cost for the search a
> larger folio is pretty negligable compared the gains of finding one.
> That seems to be nice but it gets me wondering if there are other
> benchmarks under which there is any penalties instead.
> 
> Ie, is the above a microbenchmark where this yields good results?

No, the workload gains are general - they avoid the lock contention
problems involved with cycling, accounting and state changes for
millions of objects (order-0 folios) a second through a single
exclusive lock (mapping tree lock) by reducing the mapping tree lock
cycling by a couple of orders of magnitude.

> > It's also the right thing to do.  This is a project that has been on
> > the back burner for years, it just hasn't been important enough to do
> > before now.
> 
> Commit ebb7fb1557b1 (xfs, iomap: limit individual ioend chain lengths in
> writeback") dates back to just one year, and so it gets me wondering
> how a project in the back burner for years now finds motivation for
> just a one year old regression.

The change in ebb7fb1557b1 is just the straw that broke the camel's
back. It got rid of the massive IO batch processing we used to
minimise the inherent cross-process mapping tree contention in
buffered writes. i.e. the process doing write() calls, multiple
kswapds doing memory reclaim, writeback submission and writeback
completion all contend at the same time for the mapping tree lock.

We largely removed the IO submission and completion from the picture
with huge batch processing, but that then started causing latency
problems with IO completion processing. So we went back to smaller
chunks of IO submission and completion, and that means we went from
3 threads contending on the mapping tree lock to 5 threads. And that
drove the system into catastrophic lock breakdown on the mapping
tree lock.

And so -everything- then went really slow because each write() task
burns down at least 5 CPUs on the mapping tree lock each....

THis is not an XFS issue to solve - this is a general page cache
problem and we've always wanted to fix it in the page cache, either
by large folio support or by large block size support that required
aligned high-order pages in the page cache. Same solution - less
pages to iterate - but different methods...

> What was the original motivation of the older project dating this
> effort back to its inception?

https://www.kernel.org/doc/ols/2006/ols2006v1-pages-177-192.pdf

That was run on a 64kB page size machine (itanic), but there were
signs that the mapping tree lock would be an issue in future.
Indeed, when these large SSI supercomputers moved to x86-64 (down to
4kB page size) a couple of years later, the mapping tree lock popped
up to the top of the list of buffered write throughput limiting
factors.

i.e. the larger the NUMA distances between the workload doing the
write and the node the mapping tree lock is located on, the slower
buffered writes go and the more CPU we burnt on the mapping tree
lock. We carefully worked around that with cpusets and locality
control, and the same was then done in HPC environments on these
types of machines and hence it wasn't an immediate limiting
factor...

But we're talking about multi-million dollar supercomputers here,
and in most cases people just rewrote the apps to use direct IO and
so the problem just went away and the HPC apps could easily use all
the perofrmance the storage provided....

IOWs, we've know about this problem for over 15 years, but the
difference is now that consumer hardware is capable of >10GB/s write
speeds (current PCIe 5.0 nvme ssds) for just a couple of hundred
dollars, rather than multi-million dollar machines in carefully
optimised environments that we first saw it on back in 2006...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx