Re: [PATCH 11/11] DIO: optimize cache misses in the submission path

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Andi Kleen <andi@xxxxxxxxxxxxxx> writes:

> From: Andi Kleen <ak@xxxxxxxxxxxxxxx>
>
> Some investigation of a transaction processing workload showed that
> a major consumer of cycles in __blockdev_direct_IO is the cache miss
> while accessing the block size. This is because it has to walk
> the chain from block_dev to gendisk to queue.
>
> The block size is needed early on to check alignment and sizes.
> It's only done if the check for the inode block size fails.
> But the costly block device state is unconditionally fetched.
>
> - Reorganize the code to only fetch block dev state when actually
> needed.
>
> Then do a prefetch on the block dev early on in the direct IO
> path. This is worth it, because there is substantial code run
> before we actually touch the block dev now.
>
> - I also added some unlikelies to make it clear the compiler
> that block device fetch code is not normally executed.
>
> This gave a small, but measurable improvement on a large database
> benchmark (about 0.3%)
>
> BTW the check code looks somewhat dubious to me: why is the block size
> blk size only checked when the inode size check fails? Can
> someone explain the difference between all these different block
> sizes? Are they cheaper in a dozen?

There are two block sizes, the block size of the file system (typically
PAGE_SHIFT), and the logical block size of the underlying storage.  The
dio blkfactor represents the number of dio blocks in a single fs block.
Alignment to the fs block means that you don't have to do any sub-block
zeroing.  It also means you don't have to do as much math in converting
between dio blocks and fs blocks (big deal, right?).

I bet we could default to using the smaller block size all the time, and
still be able to detect when we don't have to do the sub-block zeroing.
Maybe that would be a good follow-on patch.

> Signed-off-by: Andi Kleen <ak@xxxxxxxxxxxxxxx>
> ---
>  fs/direct-io.c |   47 +++++++++++++++++++++++++++++++++++++----------
>  1 files changed, 37 insertions(+), 10 deletions(-)
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 03bcc6f..c424b88 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -1086,8 +1086,8 @@ static inline int drop_refcount(struct dio *dio)
>   * individual fields and will generate much worse code. 
>   * This is important for the whole file.
>   */
> -ssize_t
> -__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
> +static inline ssize_t
> +do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>  	struct block_device *bdev, const struct iovec *iov, loff_t offset, 
>  	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
>  	dio_submit_t submit_io,	int flags)
> @@ -1096,7 +1096,6 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>  	size_t size;
>  	unsigned long addr;
>  	unsigned blkbits = inode->i_blkbits;
> -	unsigned bdev_blkbits = 0;
>  	unsigned blocksize_mask = (1 << blkbits) - 1;
>  	ssize_t retval = -EINVAL;
>  	loff_t end = offset;
> @@ -1109,12 +1108,14 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>  	if (rw & WRITE)
>  		rw = WRITE_ODIRECT;
>  
> -	if (bdev)
> -		bdev_blkbits = blksize_bits(bdev_logical_block_size(bdev));
> +	/* 
> +	 * Avoid references to bdev if not absolutely needed to give
> +	 * the early prefetch in the caller enough time.
> +	 */
>  
> -	if (offset & blocksize_mask) {
> +	if (unlikely(offset & blocksize_mask)) {

You can't make this assumption.  Userspace controls what size/alignment
of blocks to send in.

> @@ -1312,6 +1315,30 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
>  out:
>  	return retval;
>  }
> +
> +ssize_t
> +__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode,
> +	struct block_device *bdev, const struct iovec *iov, loff_t offset, 
> +	unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io,
> +	dio_submit_t submit_io,	int flags)
> +{
> +	/* 
> +	 * The block device state is needed in the end to finally
> +	 * submit everything.  Since it's likely to be cache cold
> +	 * prefetch it here as first thing to hide some of the
> +	 * latency.
> +	 * 
> +	 * Attempt to prefetch the pieces we likely need later.
> +	 */
> +	prefetch(&bdev->bd_disk->part_tbl);
> +	prefetch(bdev->bd_queue);
> +	prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES);
> +
> +	return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset,
> +				     nr_segs, get_block, end_io,
> +				     submit_io, flags);
> +}
> +
>  EXPORT_SYMBOL(__blockdev_direct_IO);

Heh... you broke direct_io_worker out again (kind of).  ;-)

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux