Re: code questions about ext4_inode_datasync_dirty()

Jan Kara <jack@xxxxxxx> · Wed, 13 Jan 2021 18:19:43 +0100

Hi!

On Tue 12-01-21 19:45:06, Xiaoguang Wang wrote:
> I use io_uring to evaluate ext4 randread performance(direct io), observed
> obvious overhead in jbd2_transaction_committed():
> Samples: 124K of event 'cycles:ppp', Event count (approx.): 80630088951
> Overhead  Command          Shared Object      Symbol
>    7.02%  io_uring-sq-per  [kernel.kallsyms]  [k] jbd2_transaction_committed

Hum, that's quite a bit. Likely due to cacheline contention on
j_state_lock. Thanks for reporting this!

> The codes:
> 	/*
> 	 * Writes that span EOF might trigger an I/O size update on completion,
> 	 * so consider them to be dirty for the purpose of O_DSYNC, even if
> 	 * there is no other metadata changes being made or are pending.
> 	 */
> 	iomap->flags = 0;
> 	if (ext4_inode_datasync_dirty(inode) ||
> 	    offset + length > i_size_read(inode))
> 		iomap->flags |= IOMAP_F_DIRTY;
> 
> ext4_inode_datasync_dirty() calls jbd2_transaction_committed(). Sorry, I
> don't spend much time to learn iomap codes yet, just ask a quick question
> here. Do we need to call ext4_inode_datasync_dirty() for a read
> operation?

So strictly speaking we don't need to know the value of IOMAP_F_DIRTY in
that path (or any read path for that matter) but I'm somewhat reluctant to
optimize out setting of this flag because later some user might start to
depend on it being set correctly.

> If we must call ext4_inode_datasync_dirty() for a read operation, can we improve
> jbd2_transaction_committed() a bit, for example, have a quick check between
> inode->i_datasync_tid and j_commit_sequence, if inode->i_datasync_tid is less than
> or equal to j_commit_sequence, we also don't call jbd2_transaction_committed()?

Well, the modification would belong directly to
jbd2_transaction_committed(). Using j_commit_sequence is somewhat awkward
since TIDs can wrap around and so very old TIDs can suddently start to give
false positives leading to a strange results (this was one of motivations
to switch jbd2_transaction_committed() to a comparison for equality). But
it could certainly be done.

But j_state_lock is a scalability bottleneck in other cases as well. So
what I rather have in mind is that we could change transactions to be RCU
freed and then a majority of places where we use

  read_lock(&journal->j_state_lock);

can be switched to using plain RCU which should significantly reduce the
contention on the lock.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR