[BUG REPORT] potential deadlock in inode evicting under the inode lru traversing context on ext4 and ubifs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi. Recently, we found a deadlock in inode recliaiming process caused by circular dependence between file inode and file's xattr inode.

Problem description
===================

The inode reclaiming process(See function prune_icache_sb) collects all reclaimable inodes and mark them with I_FREEING flag at first, at that time, other processes will be stuck if they try getting these inodes(See function find_inode_fast), then the reclaiming process destroy the inodes by function dispose_list(). Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may do inode lookup in the inode evicting callback function, if the inode lookup is operated under the inode lru traversing context, deadlock problems may happen.

Case 1: In function ext4_evict_inode(), the ea inode lookup could happen if ea_inode feature is enabled, the lookup process will be stuck under the evicting context like this:

 1. File A has inode i_reg and an ea inode i_ea
 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
 3. Then, following three processes running like this:

    PA                              PB
 echo 2 > /proc/sys/vm/drop_caches
  shrink_slab
   prune_dcache_sb
   // i_reg is added into lru, lru->i_ea->i_reg
   prune_icache_sb
    list_lru_walk_one
     inode_lru_isolate
      i_ea->i_state |= I_FREEING // set inode state
     inode_lru_isolate
      __iget(i_reg)
      spin_unlock(&i_reg->i_lock)
      spin_unlock(lru_lock)
                                     rm file A
                                      i_reg->nlink = 0
      iput(i_reg) // i_reg->nlink is 0, do evict
       ext4_evict_inode
        ext4_xattr_delete_inode
         ext4_xattr_inode_dec_ref_all
      ext4_xattr_inode_iget
           ext4_iget(i_ea->i_ino)
            iget_locked
             find_inode_fast
              __wait_on_freeing_inode(i_ea) ----→ AA deadlock
    dispose_list // cannot be executed by prune_icache_sb
     wake_up_bit(&i_ea->i_state)

Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file deleting process holds BASEHD's wbuf->io_mutex while getting the xattr inode, which could race with inode reclaiming process(The reclaiming process could try locking BASEHD's wbuf->io_mutex in inode evicting function), then an ABBA deadlock problem would happen as following:

 1. File A has inode ia and a xattr(with inode ixa), regular file B has
    inode ib and a xattr.
 2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa
 3. Then, following three processes running like this:

        PA                PB                        PC
                echo 2 > /proc/sys/vm/drop_caches
                 shrink_slab
          prune_dcache_sb
          // ib and ia are added into lru, lru->ixa->ib->ia
                  prune_icache_sb
                   list_lru_walk_one
                    inode_lru_isolate
                     ixa->i_state |= I_FREEING // set inode state
                    inode_lru_isolate
                     __iget(ib)
                     spin_unlock(&ib->i_lock)
                     spin_unlock(lru_lock)
                                                   rm file B
                                                    ib->nlink = 0
 rm file A
  iput(ia)
   ubifs_evict_inode(ia)
    ubifs_jnl_delete_inode(ia)
     ubifs_jnl_write_inode(ia)
      make_reservation(BASEHD) // Lock wbuf->io_mutex
      ubifs_iget(ixa->i_ino)
       iget_locked
        find_inode_fast
         __wait_on_freeing_inode(ixa)
          |          iput(ib) // ib->nlink is 0, do evict
          |           ubifs_evict_inode
          |            ubifs_jnl_delete_inode(ib)
         ↓             ubifs_jnl_write_inode
     ABBA deadlock ←-----make_reservation(BASEHD)
                   dispose_list // cannot be executed by prune_icache_sb
                    wake_up_bit(&ixa->i_state)

Reproducer:
===========

https://bugzilla.kernel.org/show_bug.cgi?id=219022

About solutions
===============

We have thought some solutions, but all of them will import new influence.

Solution 1: Don't cache xattr inode, make drop_inode callback return true for xattr inode. It will make getting xattr slower.
Test code:
    gettimeofday(&s, NULL);
    for (i = 0; i < 10000; ++i)
        if (getxattr("/root/temp/file_a", "user.a", buf, 4098) < 0)
            perror("getxattr");
    gettimeofday(&e, NULL);
printf("cost %ld us\n", 1000000 * (e.tv_sec - s.tv_sec) + e.tv_usec - s.tv_usec);
Result:
ext4:
cost 151068 us // without fix
cost 321594 us // with fix
ubifs:
134125 us // without fix
371023 us // with fix

Solution 2: Don't put xattr inode into lru, which is implemented by holding xattr inode's refcnt until the file inode is evicted, besides that, make drop_inode callback return true for xattr inode. The solution pins xattr inode in memory until the file inode is evicted, file inode won't be evicted if it has pagecahes, specifically:
inode_lru_isolate
if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) { // file inode won't be evicted, so its' xattr inode won't be reclaimed too, which will increase the memory noise.
  __iget(inode);
  if (remove_inode_buffers(inode))
  ...
  iput(inode);
 }

Solution 3: Forbid inode evicting under the inode lru traversing context. Specifically, mark inode with 'I_FREEING' instead of getting its' refcnt to eliminate the inode getting chance in inode_lru_isolate(). The solution is wrong, because the pagecahes may still alive after invalidate_mapping_pages(), so we cannot destroy the file inode after clearing I_WILL_FREE.
diff --git a/fs/inode.c b/fs/inode.c
index 3a41f83a4ba5..c649be22f841 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -843,7 +844,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
      * be under pressure before the cache inside the highmem zone.
      */
     if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data)) {
-        __iget(inode);
+        inode->i_state |= I_WILL_FREE;
         spin_unlock(&inode->i_lock);
         spin_unlock(lru_lock);
         if (remove_inode_buffers(inode)) {
@@ -855,9 +856,9 @@ static enum lru_status inode_lru_isolate(struct list_head *item,
                 __count_vm_events(PGINODESTEAL, reap);
             mm_account_reclaimed_pages(reap);
         }
-        iput(inode);
+        spin_lock(&inode->i_lock);
+        inode->i_state &= ~I_WILL_FREE;
         spin_lock(lru_lock);
-        return LRU_RETRY;
     }

Solution 4: Break the circular dependence between file inode and file's xattr inode, for example, after comparing inode_lru_isolate(prune_icache_sb) with invalidate_inodes, we found that invalidate_mapping_pages is not invoked by invalidate_inodes, can we directly remove the branch 'if (inode_has_buffers(inode) || !mapping_empty(&inode->i_data))' from inode_lru_isolate?

Any better solutions?





[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux