Re: [PATCH] btrfs: fix the race between umount and btrfs-cleaner

Julian Sun <sunjunchao2870@xxxxxxxxx> · Thu, 22 Aug 2024 19:41:31 +0800

Hi, Josef.
Thanks for your review and comments.
Yeah, you are correct, sorry for the mistake.

I'm thinking if is there any misunderstanding in the following scenario

umount thread:                     btrfs-cleaner thread:
                                   btrfs_run_delayed_iputs()
                                     ->run_delayed_iput_locked()
                                       ->iput(inode)
                                         ->spin_lock(inode)
 				         // inode->i_count dec to 0
                                         ->iput_final()
                                           ->__inode_add_lru()
                                            // mapping_shrinkable() returned false
                                            // so iput_final() returned with i_count=0
                                            // note here: the inode still
 btrfs_kill_super()                         // in the sb list and I_FREEING
   ->generic_shutdown_super()               // was not set
     ->evict_inodes()                       ->spin_unlock(inode)
      ->spin_lock(inode)
      // i_count is 0 and                // continue running...
      // I_FREEING was not set           ->__btrfs_run_defrag_inode()
      // so the inode was added to         ->btrfs_iget()
      // dispost list                        // note the inode is still
                                             // in the sb list  
                                             // and I_WILL_FREE|I_FREEING
                                             // was not set
  ->spin_unlock()                            ->find_inode()
                                              ->spin_lock(inode)
                                              ->__iget();
                                              ->spin_unlock();
  ->dispose_list()                           
    ->evict()                             ->iput()
     // note: i_lock was not                ->spin_lock(inode)
     // hold here.                          ->iput_final()
                                              ->spin_unlock()
                                              ->evict()

Now, we have two threads simultaneously evicting
the same inode, which led to a bug.

The key points here are:
1. btrfs-cleaner: iput_final() called __inode_add_lru() and then
return because some reason with i_count=0, and I_FREEING was not set, 
and most important, i_lock was unlocked,
which gives evict_inodes() a perfect chance to put the inode into dispose list.
2. umount thread: The inode (which i_count is 0 and I_FREEING was not set) 
was added to dispose list.
3. btrfs-cleaner: Before umount thread set I_FREEING flag, btrfs_iget()
finded the inode and inc i_count
4. umount thread called evict() and btrfs-cleaner called iput(), which leads 
to bug.

The above behavior can be confirmed by the debugging log 
based on commit d30d0e49da71:

diff --git a/fs/btrfs/defrag.c b/fs/btrfs/defrag.c
index 407ccec3e57e..cc50dd1e8beb 100644
--- a/fs/btrfs/defrag.c
+++ b/fs/btrfs/defrag.c
@@ -278,6 +278,8 @@ static int __btrfs_run_defrag_inode(struct
btrfs_fs_info *fs_info,
 	ret = btrfs_defrag_file(inode, NULL, &range, defrag->transid,
 				       BTRFS_DEFRAG_BATCH);
 	sb_end_write(fs_info->sb);
+	if (inode->i_state & (1<<19))
+		pr_info("bug! double evict! crash will happen\n");
 	iput(inode);
 
 	if (ret < 0)
diff --git a/fs/inode.c b/fs/inode.c
index 3a41f83a4ba5..641a828bc2b7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -461,8 +461,14 @@ static void __inode_add_lru(struct inode *inode,
bool rotate)
 		return;
 	if (!(inode->i_sb->s_flags & SB_ACTIVE))
 		return;
-	if (!mapping_shrinkable(&inode->i_data))
+	if (!mapping_shrinkable(&inode->i_data)) {
+		if (inode->i_ino == 261) {
+			pr_info("set inode 0x%px dbg_flag casue
mapping_shrinkable\n", inode);
+			dump_stack();
+			WRITE_ONCE(inode->i_state, inode-
>i_state|(1<<19));
+		}
 		return;
+	}
 
 	if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode-
>i_lru))
 		this_cpu_inc(nr_unused);
@@ -623,7 +629,10 @@ void clear_inode(struct inode *inode)
 	xa_unlock_irq(&inode->i_data.i_pages);
 	BUG_ON(!list_empty(&inode->i_data.i_private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
-	BUG_ON(inode->i_state & I_CLEAR);
+	if (inode->i_state & I_CLEAR) {
+		pr_info("inode 0x%px state 0x%lx triggered bug\n",
inode, inode->i_state);
+		BUG();
+	}
 	BUG_ON(!list_empty(&inode->i_wb_list));
 	/* don't need i_lock here, no concurrent mods to i_state */
 	inode->i_state = I_FREEING | I_CLEAR;
@@ -1754,7 +1763,10 @@ void iput(struct inode *inode)
 {
 	if (!inode)
 		return;
-	BUG_ON(inode->i_state & I_CLEAR);
+	if (inode->i_state & I_CLEAR) {
+		pr_info("inode 0x%px state 0x%lx triggered a bug\n",
inode, inode->i_state);
+		BUG();
+	}
 retry:
 	if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) {
 		if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME))
{
@@ -1764,6 +1776,8 @@ void iput(struct inode *inode)
 			mark_inode_dirty_sync(inode);
 			goto retry;
 		}
+		if (inode->i_ino == 261 && (inode->i_state &(1<<19)))
+			pr_info("state is 0x%lx, bug may be
triggered\n", inode->i_state);
 		iput_final(inode);
 	}
 }

The (1<<19) bit was used to debugging.

Here is the log when bug was triggered:

[   33.028024][ T2451] set inode 0xffff88801fb87738 dbg_flag casue
mapping_shrinkable
[   33.028478][ T2451] CPU: 0 PID: 2451 Comm: btrfs-cleaner Not tainted
6.10.0-rc2-00225-gf0deb01d52b3-dirty #132
[   33.029072][ T2451] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   33.029675][ T2451] Call Trace:
[   33.029873][ T2451]  <TASK>
[   33.030046][ T2451]  dump_stack_lvl+0x16c/0x1f0
[   33.030330][ T2451]  __inode_add_lru.part.0+0x1ea/0x280
[   33.030647][ T2451]  iput+0x7c7/0x920
[   33.030876][ T2451]  run_delayed_iput_locked+0x136/0x1e0
[   33.031199][ T2451]  btrfs_run_delayed_iputs+0x8e/0x120
[   33.031516][ T2451]  cleaner_kthread+0x2d3/0x480
[   33.031801][ T2451]  ? __pfx_cleaner_kthread+0x10/0x10
[   33.032113][ T2451]  kthread+0x2c1/0x3a0
[   33.032354][ T2451]  ? _raw_spin_unlock_irq+0x23/0x50
[   33.032659][ T2451]  ? __pfx_kthread+0x10/0x10
[   33.032929][ T2451]  ret_from_fork+0x45/0x80
[   33.033198][ T2451]  ? __pfx_kthread+0x10/0x10
[   33.033469][ T2451]  ret_from_fork_asm+0x1a/0x30
[   33.033756][ T2451]  </TASK>
[   33.033978][ T2451] bug! double evict! crash will happen
[   33.034168][ T2332] BTRFS info (device loop0): last unmount of
filesystem c9fe44da-de57-406a-8241-57ec7d4412cf
[   33.034295][ T2451] inode 0xffff88801fb87738 state 0x60 triggered a
bug
[   33.034332][ T2451] ------------[ cut here ]------------
[   33.035578][ T2451] kernel BUG at fs/inode.c:1768!
[   33.035873][ T2451] Oops: invalid opcode: 0000 [#1] PREEMPT SMP
KASAN NOPTI
[   33.036286][ T2451] CPU: 0 PID: 2451 Comm: btrfs-cleaner Not tainted
6.10.0-rc2-00225-gf0deb01d52b3-dirty #132
[   33.036870][ T2451] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   33.037464][ T2451] RIP: 0010:iput+0x896/0x920
// logs ...
[   33.042882][ T2451] PKRU: 55555554
[   33.043092][ T2451] Call Trace:
[   33.043288][ T2451]  <TASK>
[   33.044190][ T2451]  ? iput+0x896/0x920
[   33.047321][ T2451]  ? btrfs_run_defrag_inodes+0xb49/0xe60
[   33.047653][ T2451]  btrfs_run_defrag_inodes+0xa40/0xe60
[   33.047975][ T2451]  ? __pfx_btrfs_run_defrag_inodes+0x10/0x10
[   33.048327][ T2451]  ? __pfx___mutex_unlock_slowpath+0x10/0x10
[   33.048678][ T2451]  ? __pfx_do_raw_spin_lock+0x10/0x10
[   33.048996][ T2451]  ? _raw_spin_unlock+0x28/0x50
[   33.049285][ T2451]  ? btrfs_clean_one_deleted_snapshot+0x2b2/0x420
[   33.049660][ T2451]  cleaner_kthread+0x2ee/0x480
[   33.049941][ T2451]  ? __pfx_cleaner_kthread+0x10/0x10
[   33.050252][ T2451]  kthread+0x2c1/0x3a0
[   33.050492][ T2451]  ? _raw_spin_unlock_irq+0x23/0x50
[   33.050796][ T2451]  ? __pfx_kthread+0x10/0x10
[   33.051067][ T2451]  ret_from_fork+0x45/0x80
[   33.051333][ T2451]  ? __pfx_kthread+0x10/0x10
[   33.051603][ T2451]  ret_from_fork_asm+0x1a/0x30
[   33.051887][ T2451]  </TASK>
[   33.052066][ T2451] Modules linked in:
[   33.052316][ T2451] ---[ end trace 0000000000000000 ]---
[   33.052634][ T2451] RIP: 0010:iput+0x896/0x920

Please note inode 0xffff88801fb87738, which was set debug flag by
__inode_add_lru() in the run_delayed_iput_locked() code path. This 
indicates that the inode was not actually evicted. 
Then, please note the double evict log, which shows that this
inode which was set debug flag and was not be evicted is 
still on the sb list and was successfully acquired by btrfs_iget(). 
Next, take note of the unmount log, indicating that there was 
a concurrent unmount thread which has already called evict_inodes() at that time. 
Finally, observe where the BUG was triggered: the state of inode
0xffff88801fb87738 had been set to 0x60 by the unmount thread calling
inode_clear(), and at the same time, btrfs-cleaner called iput(),
ultimately leading to the BUG at the beginning check of iput(). 
Although the point of trigger differs from the one reported by 
syzbot, there is no doubt that they are the same bug.

You could apply the above diff and try to reproduce the bug using the 
reproducer[1].

If there is anything I'm missing or more info was needed, please let me know.

[1]: https://syzkaller.appspot.com/x/repro.c?x=14c57f16980000

Thanks.