Dave Chinner <david@xxxxxxxxxxxxx> writes: > On Tue, Nov 09, 2010 at 04:04:41PM -0500, Jeff Moyer wrote: >> Dave Chinner <david@xxxxxxxxxxxxx> writes: >> >> > On Mon, Nov 08, 2010 at 10:36:06AM -0500, Jeff Moyer wrote: >> >> Dave Chinner <david@xxxxxxxxxxxxx> writes: >> >> >> >> > From: Dave Chinner <dchinner@xxxxxxxxxx> >> >> > >> >> > To avoid concerns that a single list and lock tracking the unaligned >> >> > IOs will not scale appropriately, create multiple lists and locks >> >> > and chose them by hashing the unaligned block being zeroed. >> >> > >> >> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> >> >> > --- >> >> > fs/direct-io.c | 49 ++++++++++++++++++++++++++++++++++++------------- >> >> > 1 files changed, 36 insertions(+), 13 deletions(-) >> >> > >> >> > diff --git a/fs/direct-io.c b/fs/direct-io.c >> >> > index 1a69efd..353ac52 100644 >> >> > --- a/fs/direct-io.c >> >> > +++ b/fs/direct-io.c >> >> > @@ -152,8 +152,28 @@ struct dio_zero_block { >> >> > atomic_t ref; /* reference count */ >> >> > }; >> >> > >> >> > -static DEFINE_SPINLOCK(dio_zero_block_lock); >> >> > -static LIST_HEAD(dio_zero_block_list); >> >> > +#define DIO_ZERO_BLOCK_NR 37LL >> >> >> >> I'm always curious to know how these numbers are derived. Why 37? >> > >> > It's a prime number large enough to give enough lists to minimise >> > contention whilst providing decent distribution for 8 byte aligned >> > addresses with low overhead. XFS uses the same sort of waitqueue >> > hashing for global IO completion wait queues used by truncation >> > and inode eviction (see xfs_ioend_wait()). >> > >> > Seemed reasonable (and simple!) just to copy that design pattern >> > for another global IO completion wait queue.... >> >> OK. I just had our performance team record some statistics for me on an >> unmodified kernel during an OLTP-type workload. I've attached the >> systemtap script that I had them run. I wanted to see just how common >> the sub-page-block zeroing was, and I was frightened to find that, in a >> 10 minute period , over 1.2 million calls were recorded. If we're >> lucky, my script is buggy. Please give it a look-see. > > Well, it's just checking how many blocks are candidates for zeroing > inside the dio_zero_block() function call. i.e. the function gets > called on every newly allocated block at the start of an IO. Your > result implies that there were 1.2 million IOs requiring allocation > in ten minutes, because the next check in the dio_zero_block(): It's still surprising to me that the database log wasn't preallocated. Perhaps they just use fallocate, now. > dio_blocks_per_fs_block = 1 << dio->blkfactor; > this_chunk_blocks = dio->block_in_file & (dio_blocks_per_fs_block - 1); > > if (!this_chunk_blocks) > return; > > determines if the IO is unaligned and zeroing is really necessary or > not. Your script needs to take this into account, not just count the > number of times the function is called with a new buffer. Yeah, I can't believe I missed that. FWIW, I was told was that the database log needs to force out commits of various sizes, so it can't always issue a fixed sized/aligned I/O. Anyway, I'll have them re-run the test with the attached script. Thanks for pointing out this obvious stupidity. ;-) Dave, can you CC me and akpm on your next patch posting? The dio changes typically trickle in through Andrew's tree. Cheers, Jeff #! /usr/bin/env stap # # This file is free software. You can redistribute it and/or modify it under # the terms of the GNU General Public License (GPL); either version 2, or (at # your option) any later version. global zeroes = 0 global start_time = 0 probe kernel.function("dio_zero_block") { BH_New = 1 << 6; dio_blocks_per_fs_block = 1 << $dio->blkfactor; this_chunk_blocks = $dio->block_in_file & (dio_blocks_per_fs_block - 1); if ($dio->blkfactor != 0 && !($dio->map_bh->b_state & BH_New) && this_chunk_blocks != 0) { zeroes++; } } probe begin { start_time=gettimeofday_s(); } probe end { printf("%d zeroes performed in %d seconds\n", zeroes, gettimeofday_s() - start_time); } -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html