Re: [PATCH v2 2/5] ext4: for committing inode, make ext4_fc_track_inode wait

harshad shirwadkar <harshadshirwadkar@xxxxxxxxx> · Mon, 18 Apr 2022 14:02:46 -0700

Sorry for the late reply, I was on vacation and getting to this just now.

On Mon, 21 Mar 2022 at 07:06, Jan Kara <jack@xxxxxxx> wrote:
>
> Sorry for delayed reply, I've got dragged into other things and somewhat
> forgot about this...
>
> On Fri 11-03-22 00:25:48, harshad shirwadkar wrote:
> > Hmm, after removing ext4_fc_track_inode() from ext4_journal_start(), I
> > see one deadlock - there are some places in code where
> > ext4_mark_inode_dirty gets called while holding i_data_sem. Commit
> > path requires i_data_sem to commit inode data (via ext4_map_blocks).
> > So, if an inode is being committed, ext4_mark_inode_dirty may start
> > waiting for the inode to commit while holding i_data_sem and the
> > commit path may wait on i_data_sem.
>
> Indeed, that is a problem.
>
> > The right way to fix this is to
> > call ext4_fc_track_inode in such places before acquiring i_data_sem in
> > write mode. But that would mean we sprinkle code with more
> > ext4_fc_track_inode() calls which is something that I preferably
> > wanted to avoid.
>
> So I agree calling ext4_fc_track_inode() from ext4_reserve_inode_write()
> isn't going to fly as is. We need to find a better way of doing things.
>
> > This makes me wonder though, for fast commits, is it a terrible idea
> > to extend the meaning of ext4_journal_start() from "start a new
> > handle" to "start a new handle with an intention to modify the passed
> > inode"? Most of the handles modify only one inode, and for other
> > places where we do modify multiple inodes, ext4_reserve_inode_write()
> > would ensure that those inodes are tracked as well.
>
> Well but to avoid deadlocks like you've described above you would have to
> start tracking inode with explicit ext4_fc_track_inode() calls before
> grabbing i_data_sem. ext4_reserve_inode_write() would be only useful for an
> assertion check that indeed the inode is already tracked.
>
> > All of the
> > existing places where inode gets modified after grabbing i_data_sem,
> > i_data_sem is grabbed only after starting the handle. This would take
> > care of the deadlock mentioned above and similar deadlocks. Another
> > advantage with doing this is that developers wouldn't need to worry
> > about adding explicit ext4_fc_track_inode() calls for future changes.
>
> Well, at least for the simple case where only one inode is modified. But I
> agree that's a majority.
>
> > If we decide to do this, we would need to do a thorough code review to
> > ensure that the above rule is followed everywhere. But note that
> > ext4_fc_track_inode() is idempotent, so it doesn't matter if this
> > function gets called multiple times in the same handle. So to avoid
> > breaking fast commits, we can be super careful and in the first
> > version, we can have ext4_fc_track_range() calls in
> > ext4_reserve_inode_dirty(), ext4_journal_start(), inline.c and in
> > handles where i_data_sem gets acquired in write mode. We can then
> > carefully evaluate each code path and remove redundant
> > ext4_fc_track_range() calls.
>
> As I wrote above, if we go this path, I'd be for ext4_fc_track_inode()
> calls in ext4_journal_start() and then adding explicit calls to
> ext4_fc_track_inode() where additionally needed and have only assertion
> checks in ext4_reserve_inode_dirty() and other places which modify inode
> metadata, to catch places which need explicit calls to
> ext4_fc_track_inode(). That way we won't have any hidden deadlocks in the
> code waiting to happen.
Okay sounds good, so based on your response, I'll rework the patch
series to make following changes:
1) Add ext4_fc_track_inode() calls in ext4_journal_start()
2) Add ext4_fc_track_inode calls in places where more than one inode
are changed in a handle and / or i_data_sem is being acquired.
3) Add an assertion in ext4_reserve_inode_dirty() to check if the
inode on which write is being reserved is already being tracked.

Does that sound good?
>
> I have also another proposal for consideration how we could handle this.
> Mostly branstorming now: We could also drop the need for fastcommit code to
> acquire i_data_sem during commit. We could use just information in extent
> status tree to provide block mapping for the fastcommit code (that does not
> need i_data_sem). The only thing is we'd need to make sure modified extents
> from the status tree are not evicted from memory before the appropriate
> transaction commits so that they are available for appropriate fastcommit -
> for that we'd probably need to add TID of the transaction that last
> modified an extent and add check into the shrinker to avoid shrinking
> uncommitted extents. As a bonus, we could now add to fastcommit only
> extents with appropriate TID and thus save on extent logging for sparsely
> modified inode.
Yeah, I like this idea. I'll add that as a TODO in the code just so
that we don't lose it.

Thanks,
Harshad
>
>                                                                 Honza
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR