Re: [PATCH] vfs: Don't evict inode under the inode lru traversing context

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



在 2024/8/5 23:30, Jan Kara 写道:
On Mon 05-08-24 09:34:46, Zhihao Cheng wrote:
From: Zhihao Cheng <chengzhihao1@xxxxxxxxxx>

The inode reclaiming process(See function prune_icache_sb) collects all
reclaimable inodes and mark them with I_FREEING flag at first, at that
time, other processes will be stuck if they try getting these inodes
(See function find_inode_fast), then the reclaiming process destroy the
inodes by function dispose_list(). Some filesystems(eg. ext4 with
ea_inode feature, ubifs with xattr) may do inode lookup in the inode
evicting callback function, if the inode lookup is operated under the
inode lru traversing context, deadlock problems may happen.

Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
         if ea_inode feature is enabled, the lookup process will be stuck
	under the evicting context like this:

  1. File A has inode i_reg and an ea inode i_ea
  2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
  3. Then, following three processes running like this:

     PA                              PB
  echo 2 > /proc/sys/vm/drop_caches
   shrink_slab
    prune_dcache_sb
    // i_reg is added into lru, lru->i_ea->i_reg
    prune_icache_sb
     list_lru_walk_one
      inode_lru_isolate
       i_ea->i_state |= I_FREEING // set inode state
      inode_lru_isolate
       __iget(i_reg)
       spin_unlock(&i_reg->i_lock)
       spin_unlock(lru_lock)
                                      rm file A
                                       i_reg->nlink = 0
       iput(i_reg) // i_reg->nlink is 0, do evict
        ext4_evict_inode
         ext4_xattr_delete_inode
          ext4_xattr_inode_dec_ref_all
           ext4_xattr_inode_iget
            ext4_iget(i_ea->i_ino)
             iget_locked
              find_inode_fast
               __wait_on_freeing_inode(i_ea) ----→ AA deadlock
     dispose_list // cannot be executed by prune_icache_sb
      wake_up_bit(&i_ea->i_state)

Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file
         deleting process holds BASEHD's wbuf->io_mutex while getting the
	xattr inode, which could race with inode reclaiming process(The
         reclaiming process could try locking BASEHD's wbuf->io_mutex in
	inode evicting function), then an ABBA deadlock problem would
	happen as following:

  1. File A has inode ia and a xattr(with inode ixa), regular file B has
     inode ib and a xattr.
  2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa
  3. Then, following three processes running like this:

         PA                PB                        PC
                 echo 2 > /proc/sys/vm/drop_caches
                  shrink_slab
                   prune_dcache_sb
                   // ib and ia are added into lru, lru->ixa->ib->ia
                   prune_icache_sb
                    list_lru_walk_one
                     inode_lru_isolate
                      ixa->i_state |= I_FREEING // set inode state
                     inode_lru_isolate
                      __iget(ib)
                      spin_unlock(&ib->i_lock)
                      spin_unlock(lru_lock)
                                                    rm file B
                                                     ib->nlink = 0
  rm file A
   iput(ia)
    ubifs_evict_inode(ia)
     ubifs_jnl_delete_inode(ia)
      ubifs_jnl_write_inode(ia)
       make_reservation(BASEHD) // Lock wbuf->io_mutex
       ubifs_iget(ixa->i_ino)
        iget_locked
         find_inode_fast
          __wait_on_freeing_inode(ixa)
           |          iput(ib) // ib->nlink is 0, do evict
           |           ubifs_evict_inode
           |            ubifs_jnl_delete_inode(ib)
           ↓             ubifs_jnl_write_inode
      ABBA deadlock ←-----make_reservation(BASEHD)
                    dispose_list // cannot be executed by prune_icache_sb
                     wake_up_bit(&ixa->i_state)

Fix it by forbidding inode evicting under the inode lru traversing
context. In details, we import a new inode state flag 'I_LRU_ISOLATING'
to pin inode without holding i_count under the inode lru traversing
context, the inode evicting process will wait until this flag is
cleared from i_state.

Thanks for the patch and sorry for not getting to this myself!  Let me
rephrase the above paragraph a bit for better readability:

Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING to
pin the inode in memory while inode_lru_isolate() reclaims its pages
instead of using ordinary inode reference. This way inode deletion cannot
be triggered from inode_lru_isolate() thus avoiding the deadlock. evict()
is made to wait for I_LRU_ISOLATING to be cleared before proceeding with
inode cleanup.


Looks clearer, thanks for the rephrasing, you're really nice.
@@ -488,6 +488,36 @@ static void inode_lru_list_del(struct inode *inode)
  		this_cpu_dec(nr_unused);
  }
+static void inode_lru_isolating(struct inode *inode)

Perhaps call this inode_pin_lru_isolating()

Adopt. Will change in v2.

+{
+	BUG_ON(inode->i_state & (I_LRU_ISOLATING | I_FREEING | I_WILL_FREE));
+	inode->i_state |= I_LRU_ISOLATING;
+}
+
+static void inode_lru_finish_isolating(struct inode *inode)

And call this inode_unpin_lru_isolating()?

Adopt. Will change in v2.

Otherwise the patch looks good so feel free to add:

Reviewed-by: Jan Kara <jack@xxxxxxx>

								Honza






[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux