Topic: Data writeback scalability Scope: Performance Generic writeback infrastructure limitations Buffered IO sucks. Proposal: This is largely a "how do we solve this" discussion topic. In running small file writeback tests recently, I've realised that the single biggest performance limitation is the single threaded BDI flush worker that builds data IOs from the page cache of dirty inodes. XFS does delayed allocation, so writeback overhead is largely determined by how many physical allocations need to be done to write back the dirty data. It turns out that it's not that many. The BDI flusher becomes completely CPU bound at about 80-90k allocations per second, which means we are limited to writing back data to that many spearate regions per second. If we are writing back small files (say 4kB each), then the generic data writeback infrastructure cannot clean more than 100k files/s. I've got a device that can do 1.6M random 4kB write IOPS every second with aio+dio, yet I can't use more than about 5% of that capacity through the page cache.... SO I gamed the system: I set an extent size hint of the root directory of 4KB so that delayed allocation is never used. That got me to 160k files/s in about 40k IOPS and the flusher thread about 70% busy. Everything was still blocking in balance_dirty_pages_ratelimited(), so there's still a huge amount of IO performance being left on the table because we just can't flush dirty pages fast enough to keep modern SSDs busy from a single thread. IOWs, this is a discussion topic for how we might work towards using multiple data flushing threads efficiently for XFS. Most efficient would be a flusher thread per AG, but that is unrealistic for high AG count filesystems. Similarly, per-CPU flushers does no good if we've only got 4 AGs in the filesystem. This is made more difficult because the high speed collision that occurred years ago between the BDI infrastructure, dirty inode tracking, dirty inode writeback and cgroups has left this code a complex, fragile tangle of esoteric, untestable code. There are enough subtle race conditions between a single BDI flusher thread, writeback, mounts and the block device life cycle that everything is likely to break if we try to add concurrency into this code. So there's a big architectural question here: do we start again and try to engineer something for XFS that does everything we need and then push that towards being a generic solution (like we did with iomap to replace bufferheads), or do we pull the loose string on the existing code and try to add IO concurrency into that code without making the mess worse? What other options do we have? What other approaches to the problem are there? Does this interact with SSD specific allocation policies in some way? Is delayed allocation even relevant anymore with SSDs that can do millions of IOPS? Food for thought. -- Dave Chinner david@xxxxxxxxxxxxx