Re: Filesystem benchmarks on reasonably fast hardware

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 18 Jul 2011 20:57:49 +1000

On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote:
> On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote:
> > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> > > 
> > > Numbers below were created with sysbench, using directIO.  Each block
> > > is a matrix with results for blocksizes from 512B to 16384B and thread
> > > count from 1 to 128.  Four blocks for reads and writes, both
> > > sequential and random.
> > 
> > What's the command line/script used to generate the result matrix?
> > And what kernel are you running on?
> 
> Script is attached.  Kernel is git from July 13th (51414d41).

Ok, thanks.

> > > xfs:
> > > ====
> > > seqrd	1	2	4	8	16	32	64	128
> > > 16384	4698	4424	4397	4402	4394	4398	4642	4679	
> > > 8192	6234	5827	5797	5801	5795	6114	5793	5812	
> > > 4096	9100	8835	8882	8896	8874	8890	8910	8906	
> > > 2048	14922	14391	14259	14248	14264	14264	14269	14273	
> > > 1024	23853	22690	22329	22362	22338	22277	22240	22301	
> > > 512	37353	33990	33292	33332	33306	33296	33224	33271	
> > 
> > Something is single threading completely there - something is very
> > wrong. Someone want to send me a nice fast pci-e SSD - my disks
> > don't spin that fast... :/
> 
> I wish I could just go down the shop and pick one from the
> manufacturing line. :/

Heh. At this point any old pci-e ssd would be an improvement ;)

> > > rndwr	1	2	4	8	16	32	64	128
> > > 16384	38447	38153	38145	38140	38156	38199	38208	38236	
> > > 8192	78001	76965	76908	76945	77023	77174	77166	77106	
> > > 4096	160721	156000	157196	157084	157078	157123	156978	157149	
> > > 2048	325395	317148	317858	318442	318750	318981	319798	320393	
> > > 1024	434084	649814	650176	651820	653928	654223	655650	655818	
> > > 512	501067	876555	1290292	1217671	1244399	1267729	1285469	1298522	
> > 
> > I'm assuming that is the h/w can do 650MB/s then the numbers are in
> > iops? from 4 threads up all results equate to 650MB/s.
> 
> Correct.  Writes are spread automatically across all chips.  They are
> further cached, so until every chip is busy writing, their effective
> latency is pretty much 0.  Makes for a pretty flat graph, I agree.
> 
> > > Sequential reads are pretty horrible.  Sequential writes are hitting a
> > > hot lock again.
> > 
> > lockstat output?
> 
> Attached for the bottom right case each of seqrd and seqwr.  I hope
> the filenames are descriptive enough.

Looks like you attached the seqrd lockstat twice.

> Lockstat itself hurts
> performance.  Writes were at 32245 IO/s from 298013, reads at 22458
> IO/s from 33271.  In a way we are measuring oranges to figure out why
> our apples are so small.

Yeah, but at least it points out the lock in question - the iolock.

We grab it exclusively for a very short period of time on each
direct IO read to check the page cache state, then demote it to
shared. I can see that when IO times are very short, this will, in
fact, serialise multiple readers to a single file.

A single thread shows this locking pattern:

        sysbench-3087  [000] 2192558.643146: xfs_ilock:            dev 253:0 ino 0x83 flags IOLOCK_EXCL caller xfs_rw_ilock
        sysbench-3087  [000] 2192558.643147: xfs_ilock_demote:     dev 253:0 ino 0x83 flags IOLOCK_EXCL caller T.1428
        sysbench-3087  [000] 2192558.643150: xfs_ilock:            dev 253:0 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_map_shared
        sysbench-3087  [001] 2192558.643877: xfs_ilock:            dev 253:0 ino 0x83 flags IOLOCK_EXCL caller xfs_rw_ilock
        sysbench-3087  [001] 2192558.643879: xfs_ilock_demote:     dev 253:0 ino 0x83 flags IOLOCK_EXCL caller T.1428
        sysbench-3087  [007] 2192558.643881: xfs_ilock:            dev 253:0 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_map_shared

Two threads show this:

        sysbench-3096  [005] 2192697.678308: xfs_ilock:            dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller xfs_rw_ilock
        sysbench-3096  [005] 2192697.678314: xfs_ilock_demote:     dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller T.1428
        sysbench-3096  [005] 2192697.678335: xfs_ilock:            dev 253:0 ino 0x1c02c2 flags ILOCK_SHARED caller xfs_ilock_map_shared
        sysbench-3097  [006] 2192697.678556: xfs_ilock:            dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller xfs_rw_ilock
        sysbench-3097  [006] 2192697.678556: xfs_ilock_demote:     dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller T.1428
        sysbench-3097  [006] 2192697.678577: xfs_ilock:            dev 253:0 ino 0x1c02c2 flags ILOCK_SHARED caller xfs_ilock_map_shared
        sysbench-3096  [007] 2192697.678976: xfs_ilock:            dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller xfs_rw_ilock
        sysbench-3096  [007] 2192697.678978: xfs_ilock_demote:     dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller T.1428
        sysbench-3096  [007] 2192697.679000: xfs_ilock:            dev 253:0 ino 0x1c02c2 flags ILOCK_SHARED caller xfs_ilock_map_shared

Which shows the exclusive lock on the concurrent IO serialising on
the IO in progress. Oops, that's not good.

Ok, the patch below takes the numbers on my test setup on a 16k IO
size:

seqrd	1	2	4	8	16
vanilla	3603	2798	 2563	not tested...
patches 3707	5746	10304	12875	11016

So those numbers look a lot healthier. The patch is below, 

> -- 
> Fancy algorithms are slow when n is small, and n is usually small.
> Fancy algorithms have big constants. Until you know that n is
> frequently going to be big, don't get fancy.
> -- Rob Pike

Heh. XFS always assumes n will be big. Because where XFS is used, it
just is.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

xfs: don't serialise direct IO reads on page cache checks

From: Dave Chinner <dchinner@xxxxxxxxxx>

There is no need to grab the i_mutex of the IO lock in exclusive
mode if we don't need to invalidate the page cache. Taking hese
locks on every direct IO effective serialisaes them as taking the IO
lock in exclusive mode has to wait for all shared holders to drop
the lock. That only happens when IO is complete, so effective it
prevents dispatch of concurrent direct IO reads to the same inode.

Fix this by taking the IO lock shared to check the page cache state,
and only then drop it and take the IO lock exclusively if there is
work to be done. Hence for the normal direct IO case, no exclusive
locking will occur.

Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx>
---
 fs/xfs/linux-2.6/xfs_file.c |   17 ++++++++++++++---
 1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index 1e641e6..16a4bf0 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -321,7 +321,19 @@ xfs_file_aio_read(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
-	if (unlikely(ioflags & IO_ISDIRECT)) {
+	/*
+	 * Locking is a bit tricky here. If we take an exclusive lock
+	 * for direct IO, we effectively serialise all new concurrent
+	 * read IO to this file and block it behind IO that is currently in
+	 * progress because IO in progress holds the IO lock shared. We only
+	 * need to hold the lock exclusive to blow away the page cache, so
+	 * only take lock exclusively if the page cache needs invalidation.
+	 * This allows the normal direct IO case of no page cache pages to
+	 * proceeed concurrently without serialisation.
+	 */
+	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+	if ((ioflags & IO_ISDIRECT) && inode->i_mapping->nrpages) {
+		xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
 		xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
 
 		if (inode->i_mapping->nrpages) {
@@ -334,8 +346,7 @@ xfs_file_aio_read(
 			}
 		}
 		xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
-	} else
-		xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+	}
 
 	trace_xfs_file_read(ip, size, iocb->ki_pos, ioflags);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html