Andi Kleen <andi@xxxxxxxxxxxxxx> writes: > From: Andi Kleen <ak@xxxxxxxxxxxxxxx> > > Some investigation of a transaction processing workload showed that > a major consumer of cycles in __blockdev_direct_IO is the cache miss > while accessing the block size. This is because it has to walk > the chain from block_dev to gendisk to queue. > > The block size is needed early on to check alignment and sizes. > It's only done if the check for the inode block size fails. > But the costly block device state is unconditionally fetched. > > - Reorganize the code to only fetch block dev state when actually > needed. > > Then do a prefetch on the block dev early on in the direct IO > path. This is worth it, because there is substantial code run > before we actually touch the block dev now. > > - I also added some unlikelies to make it clear the compiler > that block device fetch code is not normally executed. > > This gave a small, but measurable improvement on a large database > benchmark (about 0.3%) > > BTW the check code looks somewhat dubious to me: why is the block size > blk size only checked when the inode size check fails? Can > someone explain the difference between all these different block > sizes? Are they cheaper in a dozen? There are two block sizes, the block size of the file system (typically PAGE_SHIFT), and the logical block size of the underlying storage. The dio blkfactor represents the number of dio blocks in a single fs block. Alignment to the fs block means that you don't have to do any sub-block zeroing. It also means you don't have to do as much math in converting between dio blocks and fs blocks (big deal, right?). I bet we could default to using the smaller block size all the time, and still be able to detect when we don't have to do the sub-block zeroing. Maybe that would be a good follow-on patch. > Signed-off-by: Andi Kleen <ak@xxxxxxxxxxxxxxx> > --- > fs/direct-io.c | 47 +++++++++++++++++++++++++++++++++++++---------- > 1 files changed, 37 insertions(+), 10 deletions(-) > > diff --git a/fs/direct-io.c b/fs/direct-io.c > index 03bcc6f..c424b88 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -1086,8 +1086,8 @@ static inline int drop_refcount(struct dio *dio) > * individual fields and will generate much worse code. > * This is important for the whole file. > */ > -ssize_t > -__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, > +static inline ssize_t > +do_blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, > struct block_device *bdev, const struct iovec *iov, loff_t offset, > unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io, > dio_submit_t submit_io, int flags) > @@ -1096,7 +1096,6 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, > size_t size; > unsigned long addr; > unsigned blkbits = inode->i_blkbits; > - unsigned bdev_blkbits = 0; > unsigned blocksize_mask = (1 << blkbits) - 1; > ssize_t retval = -EINVAL; > loff_t end = offset; > @@ -1109,12 +1108,14 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, > if (rw & WRITE) > rw = WRITE_ODIRECT; > > - if (bdev) > - bdev_blkbits = blksize_bits(bdev_logical_block_size(bdev)); > + /* > + * Avoid references to bdev if not absolutely needed to give > + * the early prefetch in the caller enough time. > + */ > > - if (offset & blocksize_mask) { > + if (unlikely(offset & blocksize_mask)) { You can't make this assumption. Userspace controls what size/alignment of blocks to send in. > @@ -1312,6 +1315,30 @@ __blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, > out: > return retval; > } > + > +ssize_t > +__blockdev_direct_IO(int rw, struct kiocb *iocb, struct inode *inode, > + struct block_device *bdev, const struct iovec *iov, loff_t offset, > + unsigned long nr_segs, get_block_t get_block, dio_iodone_t end_io, > + dio_submit_t submit_io, int flags) > +{ > + /* > + * The block device state is needed in the end to finally > + * submit everything. Since it's likely to be cache cold > + * prefetch it here as first thing to hide some of the > + * latency. > + * > + * Attempt to prefetch the pieces we likely need later. > + */ > + prefetch(&bdev->bd_disk->part_tbl); > + prefetch(bdev->bd_queue); > + prefetch((char *)bdev->bd_queue + SMP_CACHE_BYTES); > + > + return do_blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, > + nr_segs, get_block, end_io, > + submit_io, flags); > +} > + > EXPORT_SYMBOL(__blockdev_direct_IO); Heh... you broke direct_io_worker out again (kind of). ;-) Cheers, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html