Re: [PATCH] vfs: Fix potential circular locking through setxattr() and removexattr()

Jan Kara <jack@xxxxxxx> · Wed, 24 Jul 2024 15:30:09 +0200

On Tue 23-07-24 14:57:46, David Howells wrote:
> Jan Kara <jack@xxxxxxx> wrote:
> 
> > Well, it seems like you are trying to get rid of the dependency
> > sb_writers->mmap_sem. But there are other places where this dependency is
> > created, in particular write(2) path is a place where it would be very
> > difficult to get rid of it (you take sb_writers, then do all the work
> > preparing the write and then you copy user data into page cache which
> > may require mmap_sem).
> >
> > ...
> > 
> > This is the problematic step - from quite deep in the locking chain holding
> > invalidate_lock and having PG_Writeback set you suddently jump to very outer
> > locking context grabbing sb_writers. Now AFAICT this is not a real deadlock
> > problem because the locks are actually on different filesystems, just
> > lockdep isn't able to see this. So I don't think you will get rid of these
> > lockdep splats unless you somehow manage to convey to lockdep that there's
> > the "upper" fs (AFS in this case) and the "lower" fs (the one behind
> > cachefiles) and their locks are different.
> 
> I'm not sure you're correct about that.  If you look at the lockdep splat:
> 
> >  -> #2 (sb_writers#14){.+.+}-{0:0}:
> 
> The sb_writers lock is "personalised" to the filesystem type (the "#14"
> annotation) which is set here:
> 
> 	for (i = 0; i < SB_FREEZE_LEVELS; i++) {
> 		if (__percpu_init_rwsem(&s->s_writers.rw_sem[i],
> 					sb_writers_name[i],
> 					&type->s_writers_key[i]))  <----
> 			goto fail;
> 	}
> 
> in fs/super.c.

Right, forgot about that. Thanks for correction! So after pondering about
this some more, this is actually a real problem and deadlockable. See
below.

> I think the problem is (1) that on one side, you've got, say, sys_setxattr()
> taking an sb_writers lock and then accessing a userspace buffer, which (a) may
> take mm->mmap_lock and vma->vm_lock and (b) may cause reading or writeback
> from the netfs-based filesystem via an mmapped xattr name buffer].

Yes, we agree on that. I was just pointing out that:

vfs_write()
  file_start_write() -> grabs sb_writers
  generic_file_write_iter()
    generic_perform_write()
      fault_in_iov_iter_readable()

is another path which enforces exactly the same lock ordering.

> Then (2) on the other side, you have a read or a write to the network
> filesystem through netfslib which may invoke the cache, which may require
> cachefiles to check the xattr on the cache file and maybe set/remove it -
> which requires the sb_writers lock on the cache filesystem.
> 
> So if ->read_folio(), ->readahead() or ->writepages() can ever be called with
> mm->mmap_lock or vma->vm_lock held, netfslib may call down to cachefiles and
> ultimately, it should[*] then take the sb_writers lock on the backing
> filesystem to perform xattr manipulation.

Well, ->read_folio() under mm->mmap_lock is a standard thing to happen in a
page fault. Now grabbing sb_writers (of any filesystem) in that path is
problematic and can deadlock though:

page fault
  grab mm->mmap_lock
  filemap_fault()
    if (unlikely(!folio_test_uptodate(folio))) {
      filemap_read_folio() on fs A
        now if you grab sb_writers on fs B:
			freeze_super() on fs B		write(2) on fs B
							  sb_start_write(fs B)
			  sb->s_writers.frozen = SB_FREEZE_WRITE;
			  sb_wait_write(sb, SB_FREEZE_WRITE);
			    - waits for write
	sb_start_write(fs B) - blocked behind freeze_super()
							  generic_perform_write()
							    fault_in_iov_iter_readable()
							      page fault
							        grab mm->mmap_lock
								  => deadlock

Now this is not the deadlock your lockdep trace is showing but it is a
similar one. Like:

filemap_invalidate() on fs A	freeze_super() on fs B	page fault on fs A	write(2) on fs B
  filemap_invalidate_lock()				  lock mm->mmap_lock	  sb_start_write(fs B)
  filemap_fdatawrite_wbc()				  filemap_fault()
    afs_writepages()					    filemap_invalidate_lock_shared()
      cachefiles_issue_write()				      => blocks behind filemap_invalidate()
				  sb->s_writers.frozen = SB_FREEZE_WRITE;
				  sb_wait_write(sb, SB_FREEZE_WRITE);
				    => blocks behind write(2)
        sb_start_write() on fs B
	  => blocks behind freeze_super()
							  			  generic_perform_write()
										    fault_in_iov_iter_readable()
										      page fault
										        grab mm->mmap_lock
											  => deadlock

So I still maintain that grabbing sb_start_write() from quite deep within
locking hierarchy (like from writepages when having pages locked, but even
holding invalidate_lock is enough for the problems) is problematic and
prone to deadlocks.

> [*] I say "should" because at the moment cachefiles calls vfs_set/removexattr
>     functions which *don't* take this lock (which is a bug).  Is this an error
>     on the part of vfs_set/removexattr()?  Should they take this lock
>     analogously with vfs_truncate() and vfs_iocb_iter_write()?
> 
> However, as it doesn't it manages to construct a locking chain via the
> mapping.invalidate_lock, the afs vnode->validate_lock and something in execve
> that I don't exactly follow.
> 
> 
> I wonder if this is might be deadlockable by a multithreaded process (ie. so
> they share the mm locks) where one thread is writing to a cached file whilst
> another thread is trying to set/remove the xattr on that file.

Yep, see above. Now the hard question is how to fix this because what you
are doing seems to be inherent in how cachefiles fs is designed to work.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR