Re: [PATCH v2 2/5] ext4: for committing inode, make ext4_fc_track_inode wait

Jan Kara <jack@xxxxxxx> · Mon, 21 Mar 2022 15:06:17 +0100

Sorry for delayed reply, I've got dragged into other things and somewhat
forgot about this...

On Fri 11-03-22 00:25:48, harshad shirwadkar wrote:
> Hmm, after removing ext4_fc_track_inode() from ext4_journal_start(), I
> see one deadlock - there are some places in code where
> ext4_mark_inode_dirty gets called while holding i_data_sem. Commit
> path requires i_data_sem to commit inode data (via ext4_map_blocks).
> So, if an inode is being committed, ext4_mark_inode_dirty may start
> waiting for the inode to commit while holding i_data_sem and the
> commit path may wait on i_data_sem.

Indeed, that is a problem.

> The right way to fix this is to
> call ext4_fc_track_inode in such places before acquiring i_data_sem in
> write mode. But that would mean we sprinkle code with more
> ext4_fc_track_inode() calls which is something that I preferably
> wanted to avoid.

So I agree calling ext4_fc_track_inode() from ext4_reserve_inode_write()
isn't going to fly as is. We need to find a better way of doing things.

> This makes me wonder though, for fast commits, is it a terrible idea
> to extend the meaning of ext4_journal_start() from "start a new
> handle" to "start a new handle with an intention to modify the passed
> inode"? Most of the handles modify only one inode, and for other
> places where we do modify multiple inodes, ext4_reserve_inode_write()
> would ensure that those inodes are tracked as well.

Well but to avoid deadlocks like you've described above you would have to
start tracking inode with explicit ext4_fc_track_inode() calls before
grabbing i_data_sem. ext4_reserve_inode_write() would be only useful for an
assertion check that indeed the inode is already tracked.

> All of the
> existing places where inode gets modified after grabbing i_data_sem,
> i_data_sem is grabbed only after starting the handle. This would take
> care of the deadlock mentioned above and similar deadlocks. Another
> advantage with doing this is that developers wouldn't need to worry
> about adding explicit ext4_fc_track_inode() calls for future changes.

Well, at least for the simple case where only one inode is modified. But I
agree that's a majority.

> If we decide to do this, we would need to do a thorough code review to
> ensure that the above rule is followed everywhere. But note that
> ext4_fc_track_inode() is idempotent, so it doesn't matter if this
> function gets called multiple times in the same handle. So to avoid
> breaking fast commits, we can be super careful and in the first
> version, we can have ext4_fc_track_range() calls in
> ext4_reserve_inode_dirty(), ext4_journal_start(), inline.c and in
> handles where i_data_sem gets acquired in write mode. We can then
> carefully evaluate each code path and remove redundant
> ext4_fc_track_range() calls.

As I wrote above, if we go this path, I'd be for ext4_fc_track_inode()
calls in ext4_journal_start() and then adding explicit calls to
ext4_fc_track_inode() where additionally needed and have only assertion
checks in ext4_reserve_inode_dirty() and other places which modify inode
metadata, to catch places which need explicit calls to
ext4_fc_track_inode(). That way we won't have any hidden deadlocks in the
code waiting to happen.

I have also another proposal for consideration how we could handle this.
Mostly branstorming now: We could also drop the need for fastcommit code to
acquire i_data_sem during commit. We could use just information in extent
status tree to provide block mapping for the fastcommit code (that does not
need i_data_sem). The only thing is we'd need to make sure modified extents
from the status tree are not evicted from memory before the appropriate
transaction commits so that they are available for appropriate fastcommit -
for that we'd probably need to add TID of the transaction that last
modified an extent and add check into the shrinker to avoid shrinking
uncommitted extents. As a bonus, we could now add to fastcommit only
extents with appropriate TID and thus save on extent logging for sparsely
modified inode.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR