Re: [PATCH] fs: open the block device after allocation the super_block

Christian Brauner <brauner@xxxxxxxxxx> · Tue, 15 Aug 2023 16:43:07 +0200

>  		up_write(&s->s_umount);
> -		blkdev_put(bdev, fs_type);
> +		error = setup_bdev_super(s, flags, NULL);
>  		down_write(&s->s_umount);

So I've been looking through the branches to see what's ready for v6.6
and what needs some more time. While doing so I went over this again and
realized that we have an issue here.

While it looks like dropping s_umount here and calling
setup_bdev_super() is fine I think it isn't. Consider two processes
racing to create the same mount:

P1                                                                    P2
vfs_get_tree()                                                        vfs_get_tree()
-> get_tree() == get_tree_bdev()                                      -> get_tree() == get_tree_bdev()
   -> sget_fc()                                                          -> sget_fc()
        // allocate new sb; no matching sb found
      -> sb_p1 = alloc_super()
      -> hlist_add_head(&s->s_instances, &s->s_type->fs_supers)
      -> spin_unlock(&sb_lock)                                              
      // yield s_umount to avoid deadlocks
   -> up_write(&sb->s_umount)
                                                                            -> spin_lock(&sb_lock)
                                                                               // find sb_p1
                                                                               if (test(old, fc))
                                                                                       goto share_extant_sb;
      // Assume P1 sleeps on bdev_lock or open_mutex
      // in blkdev_get_by_dev().
   -> setup_bdev_super()
   -> down_write(&sb->s_umount)

Now P2 jumps to the share_extant_sb label and calls:

grab_super(sb_p1)
-> spin_unlock(&sb_lock)
-> down_write(&s->s_umount)

Since s_umount is unlocked P2 doesn't go to sleep and instead immediately
goes to retry by jumping to the "retry" label. If P1 is still sleeping
on a a bdev mutex the same thing happens again.

So if you have a range of processes P{1,n} that all try to mount the
same device you're hammering endlessly on sb_lock without ever going to
sleep like we used to. The same problem exists for all iterate_supers()
and user_get_dev() variants afaict. So that needs fixing unless I'm
wrong. grab_super() and other variants need to sleep until
setup_bdev_super() is done and not busy loop. Am I wrong here?