[XFS SUMMIT] Data writeback scalability

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 21 May 2020 18:46:48 +1000

Topic:	Data writeback scalability

Scope:	Performance
	Generic writeback infrastructure limitations
	Buffered IO sucks.

Proposal:

This is largely a "how do we solve this" discussion topic.

In running small file writeback tests recently, I've realised that
the single biggest performance limitation is the single threaded BDI
flush worker that builds data IOs from the page cache of dirty
inodes. XFS does delayed allocation, so writeback overhead is
largely determined by how many physical allocations need to be done
to write back the dirty data.

It turns out that it's not that many. The BDI flusher becomes
completely CPU bound at about 80-90k allocations per second, which
means we are limited to writing back data to that many spearate
regions per second. If we are writing back small files (say 4kB
each), then the generic data writeback infrastructure cannot clean
more than 100k files/s. I've got a device that can do 1.6M random
4kB write IOPS every second with aio+dio, yet I can't use more than
about 5% of that capacity through the page cache....

SO I gamed the system: I set an extent size hint of the root
directory of 4KB so that delayed allocation is never used. That got
me to 160k files/s in about 40k IOPS and the flusher thread about
70% busy. Everything was still blocking in
balance_dirty_pages_ratelimited(), so there's still a huge amount of
IO performance being left on the table because we just can't flush
dirty pages fast enough to keep modern SSDs busy from a single
thread.

IOWs, this is a discussion topic for how we might work towards using
multiple data flushing threads efficiently for XFS. Most efficient
would be a flusher thread per AG, but that is unrealistic for high
AG count filesystems. Similarly, per-CPU flushers does no good if
we've only got 4 AGs in the filesystem.

This is made more difficult because the high speed collision that
occurred years ago between the BDI infrastructure, dirty inode
tracking, dirty inode writeback and cgroups has left this code a
complex, fragile tangle of esoteric, untestable code. There are
enough subtle race conditions between a single BDI flusher thread,
writeback, mounts and the block device life cycle that everything is
likely to break if we try to add concurrency into this code.

So there's a big architectural question here: do we start again and
try to engineer something for XFS that does everything we need and
then push that towards being a generic solution (like we did with
iomap to replace bufferheads), or do we pull the loose string on the
existing code and try to add IO concurrency into that code without
making the mess worse?

What other options do we have? What other approaches to the problem
are there? Does this interact with SSD specific allocation policies
in some way? Is delayed allocation even relevant anymore with SSDs
that can do millions of IOPS?

Food for thought.

-- 
Dave Chinner
david@xxxxxxxxxxxxx