Patch "btrfs: do not take the uuid_mutex in btrfs_rm_device" has been added to the 5.4-stable tree

Sasha Levin <sashal@xxxxxxxxxx> · Sun, 14 Nov 2021 22:48:21 -0500

This is a note to let you know that I've just added the patch titled

    btrfs: do not take the uuid_mutex in btrfs_rm_device

to the 5.4-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     btrfs-do-not-take-the-uuid_mutex-in-btrfs_rm_device.patch
and it can be found in the queue-5.4 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit cdeec5af7258fbbc4dc60eb86848be7bcb30c9ed
Author: Josef Bacik <josef@xxxxxxxxxxxxxx>
Date:   Tue Jul 27 17:01:14 2021 -0400

    btrfs: do not take the uuid_mutex in btrfs_rm_device
    
    [ Upstream commit 8ef9dc0f14ba6124c62547a4fdc59b163d8b864e ]
    
    We got the following lockdep splat while running fstests (specifically
    btrfs/003 and btrfs/020 in a row) with the new rc.  This was uncovered
    by 87579e9b7d8d ("loop: use worker per cgroup instead of kworker") which
    converted loop to using workqueues, which comes with lockdep
    annotations that don't exist with kworkers.  The lockdep splat is as
    follows:
    
      WARNING: possible circular locking dependency detected
      5.14.0-rc2-custom+ #34 Not tainted
      ------------------------------------------------------
      losetup/156417 is trying to acquire lock:
      ffff9c7645b02d38 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x84/0x600
    
      but task is already holding lock:
      ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]
    
      which lock already depends on the new lock.
    
      the existing dependency chain (in reverse order) is:
    
      -> #5 (&lo->lo_mutex){+.+.}-{3:3}:
             __mutex_lock+0xba/0x7c0
             lo_open+0x28/0x60 [loop]
             blkdev_get_whole+0x28/0xf0
             blkdev_get_by_dev.part.0+0x168/0x3c0
             blkdev_open+0xd2/0xe0
             do_dentry_open+0x163/0x3a0
             path_openat+0x74d/0xa40
             do_filp_open+0x9c/0x140
             do_sys_openat2+0xb1/0x170
             __x64_sys_openat+0x54/0x90
             do_syscall_64+0x3b/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
    
      -> #4 (&disk->open_mutex){+.+.}-{3:3}:
             __mutex_lock+0xba/0x7c0
             blkdev_get_by_dev.part.0+0xd1/0x3c0
             blkdev_get_by_path+0xc0/0xd0
             btrfs_scan_one_device+0x52/0x1f0 [btrfs]
             btrfs_control_ioctl+0xac/0x170 [btrfs]
             __x64_sys_ioctl+0x83/0xb0
             do_syscall_64+0x3b/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
    
      -> #3 (uuid_mutex){+.+.}-{3:3}:
             __mutex_lock+0xba/0x7c0
             btrfs_rm_device+0x48/0x6a0 [btrfs]
             btrfs_ioctl+0x2d1c/0x3110 [btrfs]
             __x64_sys_ioctl+0x83/0xb0
             do_syscall_64+0x3b/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
    
      -> #2 (sb_writers#11){.+.+}-{0:0}:
             lo_write_bvec+0x112/0x290 [loop]
             loop_process_work+0x25f/0xcb0 [loop]
             process_one_work+0x28f/0x5d0
             worker_thread+0x55/0x3c0
             kthread+0x140/0x170
             ret_from_fork+0x22/0x30
    
      -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
             process_one_work+0x266/0x5d0
             worker_thread+0x55/0x3c0
             kthread+0x140/0x170
             ret_from_fork+0x22/0x30
    
      -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
             __lock_acquire+0x1130/0x1dc0
             lock_acquire+0xf5/0x320
             flush_workqueue+0xae/0x600
             drain_workqueue+0xa0/0x110
             destroy_workqueue+0x36/0x250
             __loop_clr_fd+0x9a/0x650 [loop]
             lo_ioctl+0x29d/0x780 [loop]
             block_ioctl+0x3f/0x50
             __x64_sys_ioctl+0x83/0xb0
             do_syscall_64+0x3b/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
    
      other info that might help us debug this:
      Chain exists of:
        (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
       Possible unsafe locking scenario:
             CPU0                    CPU1
             ----                    ----
        lock(&lo->lo_mutex);
                                     lock(&disk->open_mutex);
                                     lock(&lo->lo_mutex);
        lock((wq_completion)loop0);
    
       *** DEADLOCK ***
      1 lock held by losetup/156417:
       #0: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]
    
      stack backtrace:
      CPU: 8 PID: 156417 Comm: losetup Not tainted 5.14.0-rc2-custom+ #34
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      Call Trace:
       dump_stack_lvl+0x57/0x72
       check_noncircular+0x10a/0x120
       __lock_acquire+0x1130/0x1dc0
       lock_acquire+0xf5/0x320
       ? flush_workqueue+0x84/0x600
       flush_workqueue+0xae/0x600
       ? flush_workqueue+0x84/0x600
       drain_workqueue+0xa0/0x110
       destroy_workqueue+0x36/0x250
       __loop_clr_fd+0x9a/0x650 [loop]
       lo_ioctl+0x29d/0x780 [loop]
       ? __lock_acquire+0x3a0/0x1dc0
       ? update_dl_rq_load_avg+0x152/0x360
       ? lock_is_held_type+0xa5/0x120
       ? find_held_lock.constprop.0+0x2b/0x80
       block_ioctl+0x3f/0x50
       __x64_sys_ioctl+0x83/0xb0
       do_syscall_64+0x3b/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f645884de6b
    
    Usually the uuid_mutex exists to protect the fs_devices that map
    together all of the devices that match a specific uuid.  In rm_device
    we're messing with the uuid of a device, so it makes sense to protect
    that here.
    
    However in doing that it pulls in a whole host of lockdep dependencies,
    as we call mnt_may_write() on the sb before we grab the uuid_mutex, thus
    we end up with the dependency chain under the uuid_mutex being added
    under the normal sb write dependency chain, which causes problems with
    loop devices.
    
    We don't need the uuid mutex here however.  If we call
    btrfs_scan_one_device() before we scratch the super block we will find
    the fs_devices and not find the device itself and return EBUSY because
    the fs_devices is open.  If we call it after the scratch happens it will
    not appear to be a valid btrfs file system.
    
    We do not need to worry about other fs_devices modifying operations here
    because we're protected by the exclusive operations locking.
    
    So drop the uuid_mutex here in order to fix the lockdep splat.
    
    A more detailed explanation from the discussion:
    
    We are worried about rm and scan racing with each other, before this
    change we'll zero the device out under the UUID mutex so when scan does
    run it'll make sure that it can go through the whole device scan thing
    without rm messing with us.
    
    We aren't worried if the scratch happens first, because the result is we
    don't think this is a btrfs device and we bail out.
    
    The only case we are concerned with is we scratch _after_ scan is able
    to read the superblock and gets a seemingly valid super block, so lets
    consider this case.
    
    Scan will call device_list_add() with the device we're removing.  We'll
    call find_fsid_with_metadata_uuid() and get our fs_devices for this
    UUID.  At this point we lock the fs_devices->device_list_mutex.  This is
    what protects us in this case, but we have two cases here.
    
    1. We aren't to the device removal part of the RM.  We found our device,
       and device name matches our path, we go down and we set total_devices
       to our super number of devices, which doesn't affect anything because
       we haven't done the remove yet.
    
    2. We are past the device removal part, which is protected by the
       device_list_mutex.  Scan doesn't find the device, it goes down and
       does the
    
       if (fs_devices->opened)
               return -EBUSY;
    
       check and we bail out.
    
    Nothing about this situation is ideal, but the lockdep splat is real,
    and the fix is safe, tho admittedly a bit scary looking.
    
    Reviewed-by: Anand Jain <anand.jain@xxxxxxxxxx>
    Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
    Reviewed-by: David Sterba <dsterba@xxxxxxxx>
    [ copy more from the discussion ]
    Signed-off-by: David Sterba <dsterba@xxxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8af92e44deae4..344d18de1f08c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2162,8 +2162,11 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
 	u64 num_devices;
 	int ret = 0;
 
-	mutex_lock(&uuid_mutex);
-
+	/*
+	 * The device list in fs_devices is accessed without locks (neither
+	 * uuid_mutex nor device_list_mutex) as it won't change on a mounted
+	 * filesystem and another device rm cannot run.
+	 */
 	num_devices = btrfs_num_devices(fs_info);
 
 	ret = btrfs_check_raid_min_devices(fs_info, num_devices - 1);
@@ -2207,11 +2210,9 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
 		mutex_unlock(&fs_info->chunk_mutex);
 	}
 
-	mutex_unlock(&uuid_mutex);
 	ret = btrfs_shrink_device(device, 0);
 	if (!ret)
 		btrfs_reada_remove_dev(device);
-	mutex_lock(&uuid_mutex);
 	if (ret)
 		goto error_undo;
 
@@ -2293,7 +2294,6 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
 	}
 
 out:
-	mutex_unlock(&uuid_mutex);
 	return ret;
 
 error_undo: