On Thu, May 15, 2014 at 05:41:53PM +0200, Jan Kara wrote: > Hello, > > so I was recently thinking about how writeback code shuffles inodes between > lists and also how redirty_tail() clobbers dirtied_when timestamp (which broke > my sync(2) optimization). This patch series came out of that. Patch 1 is a > clear win and just needs an independent review that I didn't forget about > something. Patch 3 changes writeback list handling - IMHO it makes the logic > somewhat more straightforward as we don't have to bother shuffling inodes > between lists and we also don't need to clobber dirtied_when timestamp. > But opinions may differ... > > Patches passed xfstests run and I did some basic writeback tests using tiobench > and some artifical sync livelock tests to verify nothing regressed. So I'd > be happy if people could have a look. Performance regresses significantly. Test is on a 16p/16GB VM with a sparse 100TB XFS filesystem backed by a pair of SSDs in RAID0: ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 That creates 10 million 4k files with 16 threads and 10000 files per directory. No sync/fsync is done, so it's a pure background writeback workload. For 0-400,000 files, it runs in memory, at 400-800k files background writeback is occurring, at > 800k files foreground throttling is occurring. The file create rates and write IOPS/bw are: vanilla patched load point files iops bw files iops bw < bg thres 120k 0 0 110k 0 0 < fg thres 120k 37k 210MB/s 60k 20k 130MB/s sustained 36k 37k 210MB/s 25k 28k 150MB/s The average create rate is 40k (vanilla) vs 28k (patched). Wall times: vanilla patched real 4m27.475s 6m29.364s user 1m7.072s 1m3.590s sys 10m0.836s 22m34.362s The new code burns more than twice the system CPU whilst going significantly slower. I haven't done any further investigation to determine which patch causes the regression, but it's large enough you shoul dbe able to reproduce it. BTW, while touching this code, we should also add plugging at the upper inode writeback level - it provides a 20% performance boost to this workload. The numbers in the patch description below are old, but I just verified 3.15-rc5 gives the same scale of improvement. e.g. it almost completely negates the throughput and wall time regressions this this patchset introduces: vanilla patched patched+plug load point files iops bw files iops bw files iops bw < bg thres 120k 0 0 110k 0 0 120k 0 0 < fg thres 120k 37k 210MB/s 60k 20k 130MB/s 80k 1k 180MB/s sustained 36k 37k 210MB/s 25k 28k 150MB/s 33k 1.5k 200MB/s real 4m27.475s 6m29.364s 4m40.524s user 1m7.072s 1m3.590s 0m55.819s sys 10m0.836s 22m34.362s 18m8.130s Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx writeback: plug writeback at a high level From: Dave Chinner <dchinner@xxxxxxxxxx> Doing writeback on lots of little files causes terrible IOPS storms because of the per-mapping writeback plugging we do. This essentially causes imeediate dispatch of IO for each mapping, regardless of the context in which writeback is occurring. IOWs, running a concurrent write-lots-of-small 4k files using fsmark on XFS results in a huge number of IOPS being issued for data writes. Metadata writes are sorted and plugged at a high level by XFS, so aggregate nicely into large IOs. However, data writeback IOs are dispatched in individual 4k IOs, even when the blocks of two consecutively written files are adjacent. Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem, metadata CRCs enabled. Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches) Test: $ ./fs_mark -D 10000 -S0 -n 10000 -s 4096 -L 120 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 Result: wall sys create rate Physical write IO time CPU (avg files/s) IOPS Bandwidth ----- ----- ------------ ------ --------- unpatched 6m56s 15m47s 24,000+/-500 26,000 130MB/s patched 5m06s 13m28s 32,800+/-600 1,500 180MB/s improvement -26.44% -14.68% +36.67% -94.23% +38.46% If I use zero length files, this workload at about 500 IOPS, so plugging drops the data IOs from roughly 25,500/s to 1000/s. 3 lines of code, 35% better throughput for 15% less CPU. The benefits of plugging at this layer are likely to be higher for spinning media as the IO patterns for this workload are going make a much bigger difference on high IO latency devices..... Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> --- fs/fs-writeback.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 426ff81..7cd2b3a 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -505,6 +505,9 @@ static long writeback_inodes(struct bdi_writeback *wb, long write_chunk; long wrote = 0; /* count both pages and inodes */ struct inode *inode, *next; + struct blk_plug plug; + + blk_start_plug(&plug); restart: /* We use list_safe_reset_next() to make the list iteration safe */ @@ -603,6 +606,7 @@ restart: break; } } + blk_finish_plug(&plug); return wrote; } -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html