Re: [PATCH 0/5] SMB cached directory fixes around reconnection/unmounting

Paul Aurich <paul@xxxxxxxxxxxxxx> · Fri, 8 Nov 2024 14:44:43 -0800

No, this series doesn't try to address that. I was just focused on the 
issues around the lifecycle of the cfids and the dentry:s.

~Paul

On 2024-11-08 16:39:03 -0600, Steve French wrote:
The perf improvement allowing caching of directories (helping "ls"
then "ls" again for same dir) not just the perf improvement with "ls
"then "stat" could be very important.

Did this series also address Ralph's point about missing requesting
"H" (handle caching) flag when requesting directory leases from the
server?

On Fri, Nov 8, 2024 at 4:35 PM Paul Aurich <paul@xxxxxxxxxxxxxx> wrote:

The SMB client cached directory functionality has a few problems around
flaky/lost server connections, which manifest as a pair of BUGs when
eventually unmounting the server connection:

    [18645.013550] BUG: Dentry ffff888140590ba0{i=1000000000080,n=/}  still in use (2) [unmount of cifs cifs]
    [18645.789274] VFS: Busy inodes after unmount of cifs (cifs)

Based on bisection, these issues started with the lease directory cache
handling introduced in commit ebe98f1447bb ("cifs: enable caching of
directories for which a lease is held"), and go away if I mount with
'nohandlecache'.  I started seeing these on Debian Bookworm stable kernel
(v6.1.x), but the issues persist even in current git versions.  I think the
situation was improved (occurrence frequency went down) with
commit 5c86919455c1 ("smb: client: fix use-after-free in
smb2_query_info_compound()").

I'm able to reproduce the "Dentry still in use" errors by connecting to an
actively-used SMB share (the server organically generates lease breaks) and
leaving these running for 'a while':

- while true; do cd ~; sleep 1; for i in {1..3}; do cd /mnt/test/subdir; echo $PWD; sleep 1; cd ..; echo $PWD; sleep 1; done; echo ...; done
- while true; do iptables -F OUTPUT; mount -t cifs -a; for _ in {0..2}; do ls /mnt/test/subdir/ | wc -l; done; iptables -I OUTPUT -p tcp --dport 445 -j DROP; sleep 10; echo "unmounting"; umount -l -t cifs -a; echo "done unmounting"; sleep 20; echo "recovering"; iptables -F OUTPUT; sleep 10; done

('a while' is anywhere from 10 minutes to overnight. Also, it's not the
cleanest reproducer, but I stopped iterating once I had something that was
even remotely reliable for me...)

This series attempts to fix these, as well as a use-after-free that could
occur because open_cached_dir() explicitly frees the cached_fid, rather than
relying on reference counting.

The last patch in this series (smb: During umount, flush any pending lease
breaks for cached dirs) is a work-in-progress, and should not be taken as-is.
The issue there:

Unmounting an SMB share (cifs_kill_sb) can race with a queued lease break from
the server for a cached directory.  When this happens, the cfid is removed
from the linked list, so close_all_cached_dirs() cannot drop the dentry.  If
cifs_kill_sb continues on before the queued work puts the dentry, we trigger
the "Dentry still in use" BUG splat.  Flushing the cifsiod_wq seems to help
this, but some thoughts:

1. cifsiod_wq is a global workqueue, rather than per-mount.  Flushing the
   entire workqueue is potentially doing more work that necessary.  Should
   there be a workqueue that's more appropriately scoped?
2. With an unresponsive server, this causes umount (even umount -l) to hang
   (waiting for SMB2_close calls), and when I test with backports on a 6.1
   kernel, appears to cause a deadlock between kill_sb and some cifs
   reconnection code calling iterate_supers_type.  (Pretty sure the deadlock
   was addressed by changes to fs/super.c, so not really an SMB problem, but
   just an indication that flush_waitqueue isn't the right solution).
3. Should cached_dir_lease_break() drop the dentry before queueing work
   (and if so, is it OK to do this under the spinlock, or should the spinlock
   be dropped first)?
4. Related to #3 -- shouldn't close_all_cached_dirs() be holding the spinlock
   while looping?

Lastly, patches 2, 3, and 5 (in its final form) are beneficial going back to
v6.1 for stable, but it's not a clean backport because some other important
fixes (commit 5c86919455c1 ("smb: client: fix use-after-free in
smb2_query_info_compound()") weren't picked up.

Paul Aurich (5):
  smb: cached directories can be more than root file handle
  smb: Don't leak cfid when reconnect races with open_cached_dir
  smb: prevent use-after-free due to open_cached_dir error paths
  smb: No need to wait for work when cleaning up cached directories
  smb: During umount, flush any pending lease breaks for cached dirs

 fs/smb/client/cached_dir.c | 106 ++++++++++++++++---------------------
 1 file changed, 47 insertions(+), 59 deletions(-)

--
2.45.2

--
Thanks,

Steve