Sorry for the late reply, I was on vacation and getting to this just now. On Mon, 21 Mar 2022 at 07:06, Jan Kara <jack@xxxxxxx> wrote: > > Sorry for delayed reply, I've got dragged into other things and somewhat > forgot about this... > > On Fri 11-03-22 00:25:48, harshad shirwadkar wrote: > > Hmm, after removing ext4_fc_track_inode() from ext4_journal_start(), I > > see one deadlock - there are some places in code where > > ext4_mark_inode_dirty gets called while holding i_data_sem. Commit > > path requires i_data_sem to commit inode data (via ext4_map_blocks). > > So, if an inode is being committed, ext4_mark_inode_dirty may start > > waiting for the inode to commit while holding i_data_sem and the > > commit path may wait on i_data_sem. > > Indeed, that is a problem. > > > The right way to fix this is to > > call ext4_fc_track_inode in such places before acquiring i_data_sem in > > write mode. But that would mean we sprinkle code with more > > ext4_fc_track_inode() calls which is something that I preferably > > wanted to avoid. > > So I agree calling ext4_fc_track_inode() from ext4_reserve_inode_write() > isn't going to fly as is. We need to find a better way of doing things. > > > This makes me wonder though, for fast commits, is it a terrible idea > > to extend the meaning of ext4_journal_start() from "start a new > > handle" to "start a new handle with an intention to modify the passed > > inode"? Most of the handles modify only one inode, and for other > > places where we do modify multiple inodes, ext4_reserve_inode_write() > > would ensure that those inodes are tracked as well. > > Well but to avoid deadlocks like you've described above you would have to > start tracking inode with explicit ext4_fc_track_inode() calls before > grabbing i_data_sem. ext4_reserve_inode_write() would be only useful for an > assertion check that indeed the inode is already tracked. > > > All of the > > existing places where inode gets modified after grabbing i_data_sem, > > i_data_sem is grabbed only after starting the handle. This would take > > care of the deadlock mentioned above and similar deadlocks. Another > > advantage with doing this is that developers wouldn't need to worry > > about adding explicit ext4_fc_track_inode() calls for future changes. > > Well, at least for the simple case where only one inode is modified. But I > agree that's a majority. > > > If we decide to do this, we would need to do a thorough code review to > > ensure that the above rule is followed everywhere. But note that > > ext4_fc_track_inode() is idempotent, so it doesn't matter if this > > function gets called multiple times in the same handle. So to avoid > > breaking fast commits, we can be super careful and in the first > > version, we can have ext4_fc_track_range() calls in > > ext4_reserve_inode_dirty(), ext4_journal_start(), inline.c and in > > handles where i_data_sem gets acquired in write mode. We can then > > carefully evaluate each code path and remove redundant > > ext4_fc_track_range() calls. > > As I wrote above, if we go this path, I'd be for ext4_fc_track_inode() > calls in ext4_journal_start() and then adding explicit calls to > ext4_fc_track_inode() where additionally needed and have only assertion > checks in ext4_reserve_inode_dirty() and other places which modify inode > metadata, to catch places which need explicit calls to > ext4_fc_track_inode(). That way we won't have any hidden deadlocks in the > code waiting to happen. Okay sounds good, so based on your response, I'll rework the patch series to make following changes: 1) Add ext4_fc_track_inode() calls in ext4_journal_start() 2) Add ext4_fc_track_inode calls in places where more than one inode are changed in a handle and / or i_data_sem is being acquired. 3) Add an assertion in ext4_reserve_inode_dirty() to check if the inode on which write is being reserved is already being tracked. Does that sound good? > > I have also another proposal for consideration how we could handle this. > Mostly branstorming now: We could also drop the need for fastcommit code to > acquire i_data_sem during commit. We could use just information in extent > status tree to provide block mapping for the fastcommit code (that does not > need i_data_sem). The only thing is we'd need to make sure modified extents > from the status tree are not evicted from memory before the appropriate > transaction commits so that they are available for appropriate fastcommit - > for that we'd probably need to add TID of the transaction that last > modified an extent and add check into the shrinker to avoid shrinking > uncommitted extents. As a bonus, we could now add to fastcommit only > extents with appropriate TID and thus save on extent logging for sparsely > modified inode. Yeah, I like this idea. I'll add that as a TODO in the code just so that we don't lose it. Thanks, Harshad > > Honza > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR