Hi! On Tue 12-01-21 19:45:06, Xiaoguang Wang wrote: > I use io_uring to evaluate ext4 randread performance(direct io), observed > obvious overhead in jbd2_transaction_committed(): > Samples: 124K of event 'cycles:ppp', Event count (approx.): 80630088951 > Overhead Command Shared Object Symbol > 7.02% io_uring-sq-per [kernel.kallsyms] [k] jbd2_transaction_committed Hum, that's quite a bit. Likely due to cacheline contention on j_state_lock. Thanks for reporting this! > The codes: > /* > * Writes that span EOF might trigger an I/O size update on completion, > * so consider them to be dirty for the purpose of O_DSYNC, even if > * there is no other metadata changes being made or are pending. > */ > iomap->flags = 0; > if (ext4_inode_datasync_dirty(inode) || > offset + length > i_size_read(inode)) > iomap->flags |= IOMAP_F_DIRTY; > > ext4_inode_datasync_dirty() calls jbd2_transaction_committed(). Sorry, I > don't spend much time to learn iomap codes yet, just ask a quick question > here. Do we need to call ext4_inode_datasync_dirty() for a read > operation? So strictly speaking we don't need to know the value of IOMAP_F_DIRTY in that path (or any read path for that matter) but I'm somewhat reluctant to optimize out setting of this flag because later some user might start to depend on it being set correctly. > If we must call ext4_inode_datasync_dirty() for a read operation, can we improve > jbd2_transaction_committed() a bit, for example, have a quick check between > inode->i_datasync_tid and j_commit_sequence, if inode->i_datasync_tid is less than > or equal to j_commit_sequence, we also don't call jbd2_transaction_committed()? Well, the modification would belong directly to jbd2_transaction_committed(). Using j_commit_sequence is somewhat awkward since TIDs can wrap around and so very old TIDs can suddently start to give false positives leading to a strange results (this was one of motivations to switch jbd2_transaction_committed() to a comparison for equality). But it could certainly be done. But j_state_lock is a scalability bottleneck in other cases as well. So what I rather have in mind is that we could change transactions to be RCU freed and then a majority of places where we use read_lock(&journal->j_state_lock); can be switched to using plain RCU which should significantly reduce the contention on the lock. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR