On Wed, Mar 30, 2016 at 09:07:48AM -0600, Jens Axboe wrote: > Hi, > > This patchset isn't as much a final solution, as it's demonstration > of what I believe is a huge issue. Since the dawn of time, our > background buffered writeback has sucked. When we do background > buffered writeback, it should have little impact on foreground > activity. That's the definition of background activity... But for as > long as I can remember, heavy buffered writers has not behaved like > that. For instance, if I do something like this: > > $ dd if=/dev/zero of=foo bs=1M count=10k > > on my laptop, and then try and start chrome, it basically won't start > before the buffered writeback is done. Or, for server oriented > workloads, where installation of a big RPM (or similar) adversely > impacts data base reads or sync writes. When that happens, I get people > yelling at me. > > Last time I posted this, I used flash storage as the example. But > this works equally well on rotating storage. Let's run a test case > that writes a lot. This test writes 50 files, each 100M, on XFS on > a regular hard drive. While this happens, we attempt to read > another file with fio. > > Writers: > > $ time (./write-files ; sync) > real 1m6.304s > user 0m0.020s > sys 0m12.210s Great. So a basic IO tests looks good - let's through something more complex at it. Say, a benchmark I've been using for years to stress the Io subsystem, the filesystem and memory reclaim all at the same time: a concurent fsmark inode creation test. (first google hit https://lkml.org/lkml/2013/9/10/46) This generates thousands of REQ_WRITE metadata IOs every second, so iif I understand how the throttle works correctly, these would be classified as background writeback by the block layer throttle. And.... FSUse% Count Size Files/sec App Overhead 0 1600000 0 255845.0 10796891 0 3200000 0 261348.8 10842349 0 4800000 0 249172.3 14121232 0 6400000 0 245172.8 12453759 0 8000000 0 201249.5 14293100 0 9600000 0 200417.5 29496551 >>>> 0 11200000 0 90399.6 40665397 0 12800000 0 212265.6 21839031 0 14400000 0 206398.8 32598378 0 16000000 0 197589.7 26266552 0 17600000 0 206405.2 16447795 >>>> 0 19200000 0 99189.6 87650540 0 20800000 0 249720.8 12294862 0 22400000 0 138523.8 47330007 >>>> 0 24000000 0 85486.2 14271096 0 25600000 0 157538.1 64430611 0 27200000 0 109677.8 47835961 0 28800000 0 207230.5 31301031 0 30400000 0 188739.6 33750424 0 32000000 0 174197.9 41402526 0 33600000 0 139152.0 100838085 0 35200000 0 203729.7 34833764 0 36800000 0 228277.4 12459062 >>>> 0 38400000 0 94962.0 30189182 0 40000000 0 166221.9 40564922 >>>> 0 41600000 0 62902.5 80098461 0 43200000 0 217932.6 22539354 0 44800000 0 189594.6 24692209 0 46400000 0 137834.1 39822038 0 48000000 0 240043.8 12779453 0 49600000 0 176830.8 16604133 0 51200000 0 180771.8 32860221 real 5m35.967s user 3m57.054s sys 48m53.332s In those highlighted report points, the performance has dropped significantly. The typical range I expect to see ionce memory has filled (a bit over 8m inodes) is 180k-220k. Runtime on a vanilla kernel was 4m40s and there were no performance drops, so this workload runs almost a minute slower with the block layer throttling code. What I see in these performance dips is the XFS transaction subsystem stalling *completely* - instead of running at a steady state of around 350,000 transactions/s, there are *zero* transactions running for periods of up to ten seconds. This co-incides with the CPU usage falling to almost zero as well. AFAICT, the only thing that is running when the filesystem stalls like this is memory reclaim. Without the block throttling patches, the workload quickly finds a steady state of around 7.5-8.5 million cached inodes, and it doesn't vary much outside those bounds. With the block throttling patches, on every transaction subsystem stall that occurs, the inode cache gets 3-4 million inodes trimmed out of it (i.e. half the cache), and in a couple of cases I saw it trim 6+ million inodes from the cache before the transactions started up and the cache started growing again. > The above was run without scsi-mq, and with using the deadline scheduler, > results with CFQ are similary depressing for this test. So IO scheduling > is in place for this test, it's not pure blk-mq without scheduling. virtio in guest, XFS direct IO -> no-op -> scsi in host. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html