On Mon 05-08-24 09:34:46, Zhihao Cheng wrote: > From: Zhihao Cheng <chengzhihao1@xxxxxxxxxx> > > The inode reclaiming process(See function prune_icache_sb) collects all > reclaimable inodes and mark them with I_FREEING flag at first, at that > time, other processes will be stuck if they try getting these inodes > (See function find_inode_fast), then the reclaiming process destroy the > inodes by function dispose_list(). Some filesystems(eg. ext4 with > ea_inode feature, ubifs with xattr) may do inode lookup in the inode > evicting callback function, if the inode lookup is operated under the > inode lru traversing context, deadlock problems may happen. > > Case 1: In function ext4_evict_inode(), the ea inode lookup could happen > if ea_inode feature is enabled, the lookup process will be stuck > under the evicting context like this: > > 1. File A has inode i_reg and an ea inode i_ea > 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea > 3. Then, following three processes running like this: > > PA PB > echo 2 > /proc/sys/vm/drop_caches > shrink_slab > prune_dcache_sb > // i_reg is added into lru, lru->i_ea->i_reg > prune_icache_sb > list_lru_walk_one > inode_lru_isolate > i_ea->i_state |= I_FREEING // set inode state > inode_lru_isolate > __iget(i_reg) > spin_unlock(&i_reg->i_lock) > spin_unlock(lru_lock) > rm file A > i_reg->nlink = 0 > iput(i_reg) // i_reg->nlink is 0, do evict > ext4_evict_inode > ext4_xattr_delete_inode > ext4_xattr_inode_dec_ref_all > ext4_xattr_inode_iget > ext4_iget(i_ea->i_ino) > iget_locked > find_inode_fast > __wait_on_freeing_inode(i_ea) ----→ AA deadlock > dispose_list // cannot be executed by prune_icache_sb > wake_up_bit(&i_ea->i_state) > > Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file > deleting process holds BASEHD's wbuf->io_mutex while getting the > xattr inode, which could race with inode reclaiming process(The > reclaiming process could try locking BASEHD's wbuf->io_mutex in > inode evicting function), then an ABBA deadlock problem would > happen as following: > > 1. File A has inode ia and a xattr(with inode ixa), regular file B has > inode ib and a xattr. > 2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa > 3. Then, following three processes running like this: > > PA PB PC > echo 2 > /proc/sys/vm/drop_caches > shrink_slab > prune_dcache_sb > // ib and ia are added into lru, lru->ixa->ib->ia > prune_icache_sb > list_lru_walk_one > inode_lru_isolate > ixa->i_state |= I_FREEING // set inode state > inode_lru_isolate > __iget(ib) > spin_unlock(&ib->i_lock) > spin_unlock(lru_lock) > rm file B > ib->nlink = 0 > rm file A > iput(ia) > ubifs_evict_inode(ia) > ubifs_jnl_delete_inode(ia) > ubifs_jnl_write_inode(ia) > make_reservation(BASEHD) // Lock wbuf->io_mutex > ubifs_iget(ixa->i_ino) > iget_locked > find_inode_fast > __wait_on_freeing_inode(ixa) > | iput(ib) // ib->nlink is 0, do evict > | ubifs_evict_inode > | ubifs_jnl_delete_inode(ib) > ↓ ubifs_jnl_write_inode > ABBA deadlock ←-----make_reservation(BASEHD) > dispose_list // cannot be executed by prune_icache_sb > wake_up_bit(&ixa->i_state) > > Fix it by forbidding inode evicting under the inode lru traversing > context. In details, we import a new inode state flag 'I_LRU_ISOLATING' > to pin inode without holding i_count under the inode lru traversing > context, the inode evicting process will wait until this flag is > cleared from i_state. Thanks for the patch and sorry for not getting to this myself! Let me rephrase the above paragraph a bit for better readability: Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING to pin the inode in memory while inode_lru_isolate() reclaims its pages instead of using ordinary inode reference. This way inode deletion cannot be triggered from inode_lru_isolate() thus avoiding the deadlock. evict() is made to wait for I_LRU_ISOLATING to be cleared before proceeding with inode cleanup. > @@ -488,6 +488,36 @@ static void inode_lru_list_del(struct inode *inode) > this_cpu_dec(nr_unused); > } > > +static void inode_lru_isolating(struct inode *inode) Perhaps call this inode_pin_lru_isolating() > +{ > + BUG_ON(inode->i_state & (I_LRU_ISOLATING | I_FREEING | I_WILL_FREE)); > + inode->i_state |= I_LRU_ISOLATING; > +} > + > +static void inode_lru_finish_isolating(struct inode *inode) And call this inode_unpin_lru_isolating()? Otherwise the patch looks good so feel free to add: Reviewed-by: Jan Kara <jack@xxxxxxx> Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR