On Tue, Nov 09, 2010 at 04:04:41PM -0500, Jeff Moyer wrote: > Dave Chinner <david@xxxxxxxxxxxxx> writes: > > > On Mon, Nov 08, 2010 at 10:36:06AM -0500, Jeff Moyer wrote: > >> Dave Chinner <david@xxxxxxxxxxxxx> writes: > >> > >> > From: Dave Chinner <dchinner@xxxxxxxxxx> > >> > > >> > To avoid concerns that a single list and lock tracking the unaligned > >> > IOs will not scale appropriately, create multiple lists and locks > >> > and chose them by hashing the unaligned block being zeroed. > >> > > >> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > >> > --- > >> > fs/direct-io.c | 49 ++++++++++++++++++++++++++++++++++++------------- > >> > 1 files changed, 36 insertions(+), 13 deletions(-) > >> > > >> > diff --git a/fs/direct-io.c b/fs/direct-io.c > >> > index 1a69efd..353ac52 100644 > >> > --- a/fs/direct-io.c > >> > +++ b/fs/direct-io.c > >> > @@ -152,8 +152,28 @@ struct dio_zero_block { > >> > atomic_t ref; /* reference count */ > >> > }; > >> > > >> > -static DEFINE_SPINLOCK(dio_zero_block_lock); > >> > -static LIST_HEAD(dio_zero_block_list); > >> > +#define DIO_ZERO_BLOCK_NR 37LL > >> > >> I'm always curious to know how these numbers are derived. Why 37? > > > > It's a prime number large enough to give enough lists to minimise > > contention whilst providing decent distribution for 8 byte aligned > > addresses with low overhead. XFS uses the same sort of waitqueue > > hashing for global IO completion wait queues used by truncation > > and inode eviction (see xfs_ioend_wait()). > > > > Seemed reasonable (and simple!) just to copy that design pattern > > for another global IO completion wait queue.... > > OK. I just had our performance team record some statistics for me on an > unmodified kernel during an OLTP-type workload. I've attached the > systemtap script that I had them run. I wanted to see just how common > the sub-page-block zeroing was, and I was frightened to find that, in a > 10 minute period , over 1.2 million calls were recorded. If we're > lucky, my script is buggy. Please give it a look-see. Well, it's just checking how many blocks are candidates for zeroing inside the dio_zero_block() function call. i.e. the function gets called on every newly allocated block at the start of an IO. Your result implies that there were 1.2 million IOs requiring allocation in ten minutes, because the next check in the dio_zero_block(): dio_blocks_per_fs_block = 1 << dio->blkfactor; this_chunk_blocks = dio->block_in_file & (dio_blocks_per_fs_block - 1); if (!this_chunk_blocks) return; determines if the IO is unaligned and zeroing is really necessary or not. Your script needs to take this into account, not just count the number of times the function is called with a new buffer. > I'm all ears for next steps. We can check to see how deep the hash > chains get. We could also ask the folks at Intel to run this through > their database testing rig to get a quantification of the overhead. > > What do you think? Let's run a fixed script first - if databases are really doing so much unaligned sub-block IO, then they need to be fixed as a matter of major priority because they are doing far more IO than they need to be.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html