Jens, one of your patches from October 2013 never made it to the kernel, but would be beneficial for pmem. It helps IOPS about 15%. Original patch: https://lkml.org/lkml/2013/10/24/130 > From Jens Axboe > Subject [PATCH 05/11] direct-io: only inc/dec inode->i_dio_count for file systems > Date Thu, 24 Oct 2013 10:25:58 +0100 > > We don't need truncate protection for block devices, so add a flag > bypassing this cache line dirtying twice for every IO. This easily > contributes to 5-10% of the CPU time on high IOPS O_DIRECT testing. Here are perf top results while running fio to pmem devices using memcpy with non-temporal load and store instructions: 20.54% [pmem] [k] pmem_do_bvec.isra.6 <the memcpy function> 10.13% [kernel] [k] do_blockdev_direct_IO 5.93% [kernel] [k] inode_dio_done 4.46% [kernel] [k] bio_endio 3.07% fio [.] get_io_u 2.08% fio [.] do_io Inside do_blockdev_direct_io (10%), 60% of the time is spent atomically incrementing i_dio_count: │ static inline void atomic_inc(atomic_t *v) │ { │ asm volatile(LOCK_PREFIX "incl %0" 0.06 │ 225: lock incl 0x134(%r14) │ atomic_inc(&inode->i_dio_count); │ │ retval = 0; │ sdio.blkbits = blkbits; │ sdio.blkfactor = i_blkbits - blkbits; │ sdio.block_in_file = offset >> blkbits; 60.31 │ mov -0x1d0(%rbp),%rdx 0.16 │ mov %r12d,%ecx │ */ │ atomic_inc(&inode->i_dio_count); │ │ retval = 0; │ sdio.blkbits = blkbits; │ sdio.blkfactor = i_blkbits - blkbits; 0.00 │ sub %r12d,%ebx │ * Will be decremented at I/O completion time. │ */ │ atomic_inc(&inode->i_dio_count); inode_dio_done is taking all of its 5.8% time doing the corresponding atomic_dec. So, they're combining for 11.8% of the overall CPU time. The problem is more atomic contention than cache line dirtying. Applying your patch (changing the bitmask from 0x04 to 0x08, since 0x04 is taken now) eliminates those instructions from perf top and improves the high IOPS results by 5 to 15%. Attr Copy Read IOPS Write IOPS ==== ==== ========= ========== UC NT rd,wr 513 K 326 K with the patch: 510 K 325 K WB NT rd,wr 3.3 M 3.5 M with the patch: 3.8 M 3.9 M WC NT rd,wr 3.0 M 3.9 M with the patch: 3.1 M 4.1 M WT NT rd,wr 3.3 M 2.1 M with the patch: 3.7 M 3.7 M (there is some other test environment inconsistency with WT writes - I don't think this change really helped by 76%) --- Robert Elliott, HP Server Storage -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html