Re: [PATCH 0/4 RESEND] writeback: Dirty list handling changes

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 16 May 2014 09:55:14 +1000

On Thu, May 15, 2014 at 05:41:53PM +0200, Jan Kara wrote:
>   Hello,
> 
>   so I was recently thinking about how writeback code shuffles inodes between
> lists and also how redirty_tail() clobbers dirtied_when timestamp (which broke
> my sync(2) optimization). This patch series came out of that. Patch 1 is a
> clear win and just needs an independent review that I didn't forget about
> something. Patch 3 changes writeback list handling - IMHO it makes the logic
> somewhat more straightforward as we don't have to bother shuffling inodes
> between lists and we also don't need to clobber dirtied_when timestamp.
> But opinions may differ...
> 
> Patches passed xfstests run and I did some basic writeback tests using tiobench
> and some artifical sync livelock tests to verify nothing regressed. So I'd
> be happy if people could have a look.

Performance regresses significantly.

Test is on a 16p/16GB VM with a sparse 100TB XFS filesystem backed
by a pair of SSDs in RAID0:

./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
/mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
/mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
/mnt/scratch/6  -d  /mnt/scratch/7

That creates 10 million 4k files with 16 threads and 10000 files per
directory. No sync/fsync is done, so it's a pure background
writeback workload.

For 0-400,000 files, it runs in memory, at 400-800k files background
writeback is occurring, at > 800k files foreground throttling is
occurring.

The file create rates and write IOPS/bw are:

                  vanilla		    patched
load point     files  iops  bw		files iops  bw
< bg thres	120k   0     0		 110k   0    0
< fg thres	120k  37k  210MB/s	  60k  20k 130MB/s
sustained	 36k  37k  210MB/s	  25k  28k 150MB/s


The average create rate is 40k (vanilla) vs 28k (patched). Wall
times:

         vanilla	  patched
real    4m27.475s	 6m29.364s
user    1m7.072s	 1m3.590s
sys     10m0.836s	 22m34.362s

The new code burns more than twice the system CPU whilst going
significantly slower.

I haven't done any further investigation to determine which patch
causes the regression, but it's large enough you shoul dbe able to
reproduce it.

BTW, while touching this code, we should also add plugging at the
upper inode writeback level - it provides a 20% performance boost to
this workload. The numbers in the patch description below are old,
but I just verified 3.15-rc5 gives the same scale of improvement.
e.g. it almost completely negates the throughput and wall time
regressions this this patchset introduces:

                  vanilla	    patched		patched+plug
load point   files  iops  bw	 files iops  bw	   files iops  bw
< bg thres    120k   0     0	  110k   0    0     120k  0     0
< fg thres    120k  37k  210MB/s  60k  20k 130MB/s  80k   1k 180MB/s
sustained      36k  37k  210MB/s  25k  28k 150MB/s  33k 1.5k 200MB/s
real             4m27.475s	     6m29.364s		4m40.524s
user             1m7.072s	     1m3.590s		0m55.819s
sys              10m0.836s	     22m34.362s		18m8.130s

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

writeback: plug writeback at a high level

From: Dave Chinner <dchinner@xxxxxxxxxx>

Doing writeback on lots of little files causes terrible IOPS storms
because of the per-mapping writeback plugging we do. This
essentially causes imeediate dispatch of IO for each mapping,
regardless of the context in which writeback is occurring.

IOWs, running a concurrent write-lots-of-small 4k files using fsmark
on XFS results in a huge number of IOPS being issued for data
writes.  Metadata writes are sorted and plugged at a high level by
XFS, so aggregate nicely into large IOs. However, data writeback IOs
are dispatched in individual 4k IOs, even when the blocks of two
consecutively written files are adjacent.

Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
metadata CRCs enabled.

Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)

Test:

$ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
/mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
/mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
/mnt/scratch/6  -d  /mnt/scratch/7

Result:

		wall	sys	create rate	Physical write IO
		time	CPU	(avg files/s)	 IOPS	Bandwidth
		-----	-----	------------	------	---------
unpatched	6m56s	15m47s	24,000+/-500	26,000	130MB/s
patched		5m06s	13m28s	32,800+/-600	 1,500	180MB/s
improvement	-26.44%	-14.68%	  +36.67%	-94.23%	+38.46%

If I use zero length files, this workload at about 500 IOPS, so
plugging drops the data IOs from roughly 25,500/s to 1000/s.
3 lines of code, 35% better throughput for 15% less CPU.

The benefits of plugging at this layer are likely to be higher for
spinning media as the IO patterns for this workload are going make a
much bigger difference on high IO latency devices.....

Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
---
 fs/fs-writeback.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 426ff81..7cd2b3a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -505,6 +505,9 @@ static long writeback_inodes(struct bdi_writeback *wb,
 	long write_chunk;
 	long wrote = 0;  /* count both pages and inodes */
 	struct inode *inode, *next;
+	struct blk_plug plug;
+
+	blk_start_plug(&plug);
 
 restart:
 	/* We use list_safe_reset_next() to make the list iteration safe */
@@ -603,6 +606,7 @@ restart:
 				break;
 		}
 	}
+	blk_finish_plug(&plug);
 	return wrote;
 }
 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html