On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote: > On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote: > > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote: > > > > > > Numbers below were created with sysbench, using directIO. Each block > > > is a matrix with results for blocksizes from 512B to 16384B and thread > > > count from 1 to 128. Four blocks for reads and writes, both > > > sequential and random. > > > > What's the command line/script used to generate the result matrix? > > And what kernel are you running on? > > Script is attached. Kernel is git from July 13th (51414d41). Ok, thanks. > > > xfs: > > > ==== > > > seqrd 1 2 4 8 16 32 64 128 > > > 16384 4698 4424 4397 4402 4394 4398 4642 4679 > > > 8192 6234 5827 5797 5801 5795 6114 5793 5812 > > > 4096 9100 8835 8882 8896 8874 8890 8910 8906 > > > 2048 14922 14391 14259 14248 14264 14264 14269 14273 > > > 1024 23853 22690 22329 22362 22338 22277 22240 22301 > > > 512 37353 33990 33292 33332 33306 33296 33224 33271 > > > > Something is single threading completely there - something is very > > wrong. Someone want to send me a nice fast pci-e SSD - my disks > > don't spin that fast... :/ > > I wish I could just go down the shop and pick one from the > manufacturing line. :/ Heh. At this point any old pci-e ssd would be an improvement ;) > > > rndwr 1 2 4 8 16 32 64 128 > > > 16384 38447 38153 38145 38140 38156 38199 38208 38236 > > > 8192 78001 76965 76908 76945 77023 77174 77166 77106 > > > 4096 160721 156000 157196 157084 157078 157123 156978 157149 > > > 2048 325395 317148 317858 318442 318750 318981 319798 320393 > > > 1024 434084 649814 650176 651820 653928 654223 655650 655818 > > > 512 501067 876555 1290292 1217671 1244399 1267729 1285469 1298522 > > > > I'm assuming that is the h/w can do 650MB/s then the numbers are in > > iops? from 4 threads up all results equate to 650MB/s. > > Correct. Writes are spread automatically across all chips. They are > further cached, so until every chip is busy writing, their effective > latency is pretty much 0. Makes for a pretty flat graph, I agree. > > > > Sequential reads are pretty horrible. Sequential writes are hitting a > > > hot lock again. > > > > lockstat output? > > Attached for the bottom right case each of seqrd and seqwr. I hope > the filenames are descriptive enough. Looks like you attached the seqrd lockstat twice. > Lockstat itself hurts > performance. Writes were at 32245 IO/s from 298013, reads at 22458 > IO/s from 33271. In a way we are measuring oranges to figure out why > our apples are so small. Yeah, but at least it points out the lock in question - the iolock. We grab it exclusively for a very short period of time on each direct IO read to check the page cache state, then demote it to shared. I can see that when IO times are very short, this will, in fact, serialise multiple readers to a single file. A single thread shows this locking pattern: sysbench-3087 [000] 2192558.643146: xfs_ilock: dev 253:0 ino 0x83 flags IOLOCK_EXCL caller xfs_rw_ilock sysbench-3087 [000] 2192558.643147: xfs_ilock_demote: dev 253:0 ino 0x83 flags IOLOCK_EXCL caller T.1428 sysbench-3087 [000] 2192558.643150: xfs_ilock: dev 253:0 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_map_shared sysbench-3087 [001] 2192558.643877: xfs_ilock: dev 253:0 ino 0x83 flags IOLOCK_EXCL caller xfs_rw_ilock sysbench-3087 [001] 2192558.643879: xfs_ilock_demote: dev 253:0 ino 0x83 flags IOLOCK_EXCL caller T.1428 sysbench-3087 [007] 2192558.643881: xfs_ilock: dev 253:0 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_map_shared Two threads show this: sysbench-3096 [005] 2192697.678308: xfs_ilock: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller xfs_rw_ilock sysbench-3096 [005] 2192697.678314: xfs_ilock_demote: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller T.1428 sysbench-3096 [005] 2192697.678335: xfs_ilock: dev 253:0 ino 0x1c02c2 flags ILOCK_SHARED caller xfs_ilock_map_shared sysbench-3097 [006] 2192697.678556: xfs_ilock: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller xfs_rw_ilock sysbench-3097 [006] 2192697.678556: xfs_ilock_demote: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller T.1428 sysbench-3097 [006] 2192697.678577: xfs_ilock: dev 253:0 ino 0x1c02c2 flags ILOCK_SHARED caller xfs_ilock_map_shared sysbench-3096 [007] 2192697.678976: xfs_ilock: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller xfs_rw_ilock sysbench-3096 [007] 2192697.678978: xfs_ilock_demote: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller T.1428 sysbench-3096 [007] 2192697.679000: xfs_ilock: dev 253:0 ino 0x1c02c2 flags ILOCK_SHARED caller xfs_ilock_map_shared Which shows the exclusive lock on the concurrent IO serialising on the IO in progress. Oops, that's not good. Ok, the patch below takes the numbers on my test setup on a 16k IO size: seqrd 1 2 4 8 16 vanilla 3603 2798 2563 not tested... patches 3707 5746 10304 12875 11016 So those numbers look a lot healthier. The patch is below, > -- > Fancy algorithms are slow when n is small, and n is usually small. > Fancy algorithms have big constants. Until you know that n is > frequently going to be big, don't get fancy. > -- Rob Pike Heh. XFS always assumes n will be big. Because where XFS is used, it just is. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx xfs: don't serialise direct IO reads on page cache checks From: Dave Chinner <dchinner@xxxxxxxxxx> There is no need to grab the i_mutex of the IO lock in exclusive mode if we don't need to invalidate the page cache. Taking hese locks on every direct IO effective serialisaes them as taking the IO lock in exclusive mode has to wait for all shared holders to drop the lock. That only happens when IO is complete, so effective it prevents dispatch of concurrent direct IO reads to the same inode. Fix this by taking the IO lock shared to check the page cache state, and only then drop it and take the IO lock exclusively if there is work to be done. Hence for the normal direct IO case, no exclusive locking will occur. Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> --- fs/xfs/linux-2.6/xfs_file.c | 17 ++++++++++++++--- 1 files changed, 14 insertions(+), 3 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c index 1e641e6..16a4bf0 100644 --- a/fs/xfs/linux-2.6/xfs_file.c +++ b/fs/xfs/linux-2.6/xfs_file.c @@ -321,7 +321,19 @@ xfs_file_aio_read( if (XFS_FORCED_SHUTDOWN(mp)) return -EIO; - if (unlikely(ioflags & IO_ISDIRECT)) { + /* + * Locking is a bit tricky here. If we take an exclusive lock + * for direct IO, we effectively serialise all new concurrent + * read IO to this file and block it behind IO that is currently in + * progress because IO in progress holds the IO lock shared. We only + * need to hold the lock exclusive to blow away the page cache, so + * only take lock exclusively if the page cache needs invalidation. + * This allows the normal direct IO case of no page cache pages to + * proceeed concurrently without serialisation. + */ + xfs_rw_ilock(ip, XFS_IOLOCK_SHARED); + if ((ioflags & IO_ISDIRECT) && inode->i_mapping->nrpages) { + xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED); xfs_rw_ilock(ip, XFS_IOLOCK_EXCL); if (inode->i_mapping->nrpages) { @@ -334,8 +346,7 @@ xfs_file_aio_read( } } xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL); - } else - xfs_rw_ilock(ip, XFS_IOLOCK_SHARED); + } trace_xfs_file_read(ip, size, iocb->ki_pos, ioflags); -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html