Re: [BUG REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs

Zhihao Cheng <chengzhihao@xxxxxxxxxxxxxxx> · Fri, 19 Jul 2024 11:21:51 +0800

Hi, Jan

在 2024/7/18 21:40, Jan Kara 写道:
On Fri 12-07-24 10:37:08, Theodore Ts'o wrote:
On Fri, Jul 12, 2024 at 02:27:20PM +0800, Zhihao Cheng wrote:
Problem description
===================

The inode reclaiming process(See function prune_icache_sb) collects all
reclaimable inodes and mark them with I_FREEING flag at first, at that
time, other processes will be stuck if they try getting these inodes(See
function find_inode_fast), then the reclaiming process destroy the
inodes by function dispose_list().
Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may
do inode lookup in the inode evicting callback function, if the inode
lookup is operated under the inode lru traversing context, deadlock
problems may happen.

Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
if ea_inode feature is enabled, the lookup process will be stuck under
the evicting context like this:

  1. File A has inode i_reg and an ea inode i_ea
  2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
  3. Then, following three processes running like this:

     PA                              PB
  echo 2 > /proc/sys/vm/drop_caches
   shrink_slab
    prune_dcache_sb
    // i_reg is added into lru, lru->i_ea->i_reg
    prune_icache_sb
     list_lru_walk_one
      inode_lru_isolate
       i_ea->i_state |= I_FREEING // set inode state
       i_ea->i_state |= I_FREEING // set inode state

Um, I don't see how this can happen.  If the ea_inode is in use,
i_count will be greater than zero, and hence the inode will never be
go down the rest of the path in inode_lru_inode():

	if (atomic_read(&inode->i_count) ||
	    ...) {
		list_lru_isolate(lru, &inode->i_lru);
		spin_unlock(&inode->i_lock);
		this_cpu_dec(nr_unused);
		return LRU_REMOVED;
	}

Do you have an actual reproduer which triggers this?  Or would this
happen be any chance something that was dreamed up with DEPT?

No, it looks like a real problem and I agree with the analysis. We don't
hold ea_inode reference (i.e., ea_inode->i_count) from a normal inode. The
normal inode just owns that that special on-disk xattr reference. Standard
inode references are acquired and dropped as needed.

And this is exactly the problem: ext4_xattr_inode_dec_ref_all() called from
evict() needs to lookup the ea_inode and iget() it. So if we are processing
a list of inodes to dispose, all inodes have I_FREEING bit already set and
if ea_inode and its parent normal inode are both in the list, then the
evict()->ext4_xattr_inode_dec_ref_all()->iget() will deadlock.

Yes, absolutely right.

Normally we don't hit this path because LRU list walk is not handling
inodes with 0 link count. But a race with unlink can make that happen with
iput() from inode_lru_isolate().

Another reason is that mapping_empty(&inode->i_data) is consistent with 
mapping_shrinkable(&inode->i_data) in most cases(CONFIG_HIGHMEM is 
disabled in default on 64bit platforms, so mapping_shrinkable() hardly 
returns true if file inode's mapping has pagecahes), the problem path 
expects that mapping_shrinkable() returns true and mapping_empty() 
returns false.

Do we have any other methods to replace following if-branch without 
invoking __iget()?

        /* 

         * On highmem systems, mapping_shrinkable() permits dropping 

         * page cache in order to free up struct inodes: lowmem might 

         * be under pressure before the cache inside the highmem zone. 

         */ 

        if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) 
{
                __iget(inode);
                ...
                iput(inode); 

                spin_lock(lru_lock); 

                return LRU_RETRY; 

        }

I'm pondering about the best way to fix this. Maybe we could handle the
need for inode pinning in inode_lru_isolate() in a similar way as in
writeback code so that last iput() cannot happen from inode_lru_isolate().
In writeback we use I_SYNC flag to pin the inode and evict() waits for this
flag to clear. I'll probably sleep to it and if I won't find it too
disgusting to live tomorrow, I can code it.


I guess that you may modify like this:

diff --git a/fs/inode.c b/fs/inode.c
index f356fe2ec2b6..5b1a9b23f53f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -457,7 +457,7 @@ EXPORT_SYMBOL(ihold);

 static void __inode_add_lru(struct inode *inode, bool rotate)
 {
-       if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | 
I_WILL_FREE))
+       if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | 
I_WILL_FREE | I_PINING))
                return;
        if (atomic_read(&inode->i_count))
                return;
@@ -845,7 +845,7 @@ static enum lru_status inode_lru_isolate(struct 
list_head *item,
         * be under pressure before the cache inside the highmem zone.
         */
        if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
-               __iget(inode);
+               inode->i_state |= I_PINING;
                spin_unlock(&inode->i_lock);
                spin_unlock(lru_lock);
                if (remove_inode_buffers(inode)) {
@@ -857,7 +857,10 @@ static enum lru_status inode_lru_isolate(struct 
list_head *item,
                                __count_vm_events(PGINODESTEAL, reap);
                        mm_account_reclaimed_pages(reap);
                }
-               iput(inode);
+               spin_lock(&inode->i_lock);
+               inode->i_state &= ~I_PINING;
+               wake_up_bit(&inode->i_state, __I_PINING);
+               spin_unlock(&inode->i_lock);
                spin_lock(lru_lock);
                return LRU_RETRY;
        }
@@ -1772,6 +1775,7 @@ static void iput_final(struct inode *inode)
                return;
        }

+       inode_wait_for_pining(inode);
        state = inode->i_state;
        if (!drop) {
                WRITE_ONCE(inode->i_state, state | I_WILL_FREE);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd34b5755c0b..daf094fff5fe 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2415,6 +2415,8 @@ static inline void kiocb_clone(struct kiocb 
*kiocb, struct kiocb *kiocb_src,
 #define I_DONTCACHE            (1 << 16)
 #define I_SYNC_QUEUED          (1 << 17)
 #define I_PINNING_NETFS_WB     (1 << 18)
+#define __I_PINING             19
+#define I_PINING               (1 << __I_PINING)

 #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
 #define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)

, which means that we will import a new inode state to solve the problem.