RE: Possible FS race condition between iterate_dir and d_alloc_parallel

Jun Li <jun.li@xxxxxxx> · Fri, 6 Sep 2019 00:55:22 +0000

Hi,
> -----Original Message-----
> From: Al Viro <viro@xxxxxxxxxxxxxxxx> On Behalf Of Al Viro
> Sent: 2019年9月6日 1:48
> To: zhengbin (A) <zhengbin13@xxxxxxxxxx>
> Cc: jack@xxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; linux-fsdevel@xxxxxxxxxxxxxxx; zhangyi
> (F) <yi.zhang@xxxxxxxxxx>; renxudong1@xxxxxxxxxx; Jun Li <jun.li@xxxxxxx>
> Subject: Re: Possible FS race condition between iterate_dir and d_alloc_parallel
> 
> On Wed, Sep 04, 2019 at 02:15:58PM +0800, zhengbin (A) wrote:
> > >>
> > >> Confused...  OTOH, I might be misreading that table of yours - it's
> > >> about 30% wider than the widest xterm I can get while still being
> > >> able to read the font...
> >
> > The table is my guess. This oops happens sometimes
> >
> > (We have one vmcore, others just have log, and the backtrace is same with vmcore, so
> the reason should be same).
> >
> > Unfortunately, we do not know how to reproduce it. The vmcore has such a law:
> >
> > 1、dirA has 177 files, and it is OK
> >
> > 2、dirB has 25 files, and it is OK
> >
> > 3、When we ls dirA, it begins with ".", "..", dirB's first file, second
> > file... last file,  last file->next = &(dirB->d_subdirs)
> 
> Hmm...  Now, that is interesting.  I'm not sure it has anything to do with that bug, but
> lockless loops over d_subdirs can run into trouble.
> 
> Look: dentry_unlist() leaves the ->d_child.next pointing to the next non-cursor list element
> (or parent's ->d_subdir, if there's nothing else left).  It works in pair with d_walk(): there we
> have
>                 struct dentry *child = this_parent;
>                 this_parent = child->d_parent;
> 
>                 spin_unlock(&child->d_lock);
>                 spin_lock(&this_parent->d_lock);
> 
>                 /* might go back up the wrong parent if we have had a rename. */
>                 if (need_seqretry(&rename_lock, seq))
>                         goto rename_retry;
>                 /* go into the first sibling still alive */
>                 do {
>                         next = child->d_child.next;
>                         if (next == &this_parent->d_subdirs)
>                                 goto ascend;
>                         child = list_entry(next, struct dentry, d_child);
>                 } while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED));
>                 rcu_read_unlock();
> 
> Note the recheck of rename_lock there - it does guarantee that even if child has been
> killed off between unlocking it and locking this_parent, whatever it has ended up with in its
> ->d_child->next has *not* been moved elsewhere.  It might, in turn, have been killed off.
> In that case its ->d_child.next points to the next surviving non-cursor, also guaranteed to
> remain in the same directory, etc.
> 
> However, lose that rename_lock recheck and we'd get screwed, unless there's some
> other d_move() prevention in effect.
> 
> Note that all libfs.c users (next_positive(), move_cursor(), dcache_dir_lseek(),
> dcache_readdir(), simple_empty()) should be safe - dcache_readdir() is called with
> directory locked at least shared, uses in dcache_dir_lseek() are surrounded by the same,
> move_cursor() and simple_empty() hold ->d_lock on parent,
> next_positive() is called only under the lock on directory's inode (at least shared).  Any of
> those should prevent any kind of cross-directory moves - both into and out of.
> 
> <greps for d_subdirs/d_child users>
> 
> Huh?
> In drivers/usb/typec/tcpm/tcpm.c:
> static void tcpm_debugfs_exit(struct tcpm_port *port) {
>         int i;
> 
>         mutex_lock(&port->logbuffer_lock);
>         for (i = 0; i < LOG_BUFFER_ENTRIES; i++) {
>                 kfree(port->logbuffer[i]);
>                 port->logbuffer[i] = NULL;
>         }
>         mutex_unlock(&port->logbuffer_lock);
> 
>         debugfs_remove(port->dentry);
>         if (list_empty(&rootdir->d_subdirs)) {
>                 debugfs_remove(rootdir);
>                 rootdir = NULL;
>         }
> }
> 
> Unrelated, but obviously broken.  Not only the locking is deeply suspect, but it's trivially
> confused by open() on the damn directory.  It will definitely have ->d_subdirs non-empty.
> 
> Came in "usb: typec: tcpm: remove tcpm dir if no children", author Cc'd...  Why not
> remove the directory on rmmod?

That's because tcpm is a utility driver and there may be multiple instances
created under the directory, each instance/user removal will call to tcpm_debugfs_exit()
but only the last one should remove the directory.

Below patch changed this by using dedicated dir for each instance:

https://www.spinics.net/lists/linux-usb/msg183965.html

Li Jun
> And create on insmod, initially empty...
> 
> fs/nfsd/nfsctl.c:
> static void nfsdfs_remove_files(struct dentry *root) {
>         struct dentry *dentry, *tmp;
> 
>         list_for_each_entry_safe(dentry, tmp, &root->d_subdirs, d_child) {
>                 if (!simple_positive(dentry)) {
>                         WARN_ON_ONCE(1); /* I think this can't happen? */
>                         continue;
> It can happen - again, just have it opened and it bloody well will.
> Locking is OK, though - parent's inode is locked, so we are safe from d_move() playing
> silly buggers there.
> 
> fs/autofs/root.c:
> static void autofs_clear_leaf_automount_flags(struct dentry *dentry) { ...
>         /* Set parent managed if it's becoming empty */
>         if (d_child->next == &parent->d_subdirs &&
>             d_child->prev == &parent->d_subdirs)
>                 managed_dentry_set_managed(parent);
> 
> Same bogosity regarding the check for emptiness (that one might've been my fault).
> Locking is safe...  Not sure if all places in autofs/expire.c are careful enough...
> 
> So it doesn't look like this theory holds.  Which filesystem had that been on and what
> about ->d_parent of dentries in dirA and dirB
> ->d_subdirs?