On Fri, Nov 15, 2019 at 02:17:54PM +0000, Al Viro wrote: > On Fri, Nov 15, 2019 at 08:58:05AM -0500, Steven Rostedt wrote: > > On Fri, 15 Nov 2019 13:48:23 +0000 > > Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > > > > > BTW, what do you mean by "can debugfs_remove_recursive() rely upon the > > > > lack of attempts to create new entries inside the subtree it's trying > > > > to kill?" > > > > > > Is it possible for something to call e.g. debugfs_create_dir() (or any > > > similar primitive) with parent inside the subtree that has been > > > passed to debugfs_remove_recursive() call that is still in progress? > > > > > > If debugfs needs to cope with that, debugfs_remove_recursive() needs > > > considerably heavier locking, to start with. > > > > I don't know about debugfs, but at least tracefs (which cut and pasted > > from debugfs) does not allow that. At least in theory it doesn't allow > > that (and if it does, it's a bug in the locking at the higher levels). > > > > And perhaps debugfs shouldn't allow that either. As it is only suppose > > to be a light weight way to interact with the kernel, hence the name > > "debugfs". > > > > Yu, do you have a test case for the "infinite loop" case? > > Infinite loop, AFAICS, is reasonably easy to trigger - just open > a non-empty subdirectory and lseek to e.g. next-to-last element > in it. Again, list_empty() use in there is quite wrong - it can > give false negatives just on the cursors. No arguments about > that part... FWIW, given that debugfs_remove_recursive() has no way to report an error *and* that we have a single chokepoint for all entry creations (start_creating()) we could make sure nothing gets added pretty easily - just mark the victim dentry as "don't allow any creations here" when we first reach it and have start_creating check that, using e.g. inode_lock() for serialization. Marking could be done e.g. by setting ->d_fsdata to (void *)DEBUGFS_FSDATA_IS_REAL_FOPS_BIT, so that ->d_release() doesn't need any changes. BTW, this if (!ret) d_delete(dentry); if (d_is_reg(dentry)) __debugfs_file_removed(dentry); dput(dentry); in __debugfs_remove() is both subtle and bogus. If we get here with refcount > 1, d_delete() is just a d_drop() - dentry can't be made negative, so it gets unhashed instead. If we *do* get here with refcount 1, it will be made negative, all right. Only to be killed off by immediately following dput(), since debugfs doesn't retain unpinned dentries. Why immediate? Because d_is_reg() is obviously false on negative dentries, so __debugfs_file_removed() is not called. That's where the subtle part begins: there should've been nobody for __debugfs_file_removed() to wait for, since they would've had to hold additional references to dentry and refcount wouldn't have been 1. So that's actually not a bug. However, it's too subtle to introduce without having bothered to even comment the damn thing. As for the "bogus" part - that d_delete() is bollocks. First of all, it is fully equivalent to d_drop(). Always had been. What's more, that sucker should've been d_invalidate() instead. Anyway, AFAICS removal could be done this way: // parent is held exclusive, after is NULL or a child of parent find_next_child(parent, prev) child = NULL node = prev ? &prev->d_child : &parent->d_subdirs; grab parent->d_lock for each entry in the list starting at node->next d = container_of(entry, struct dentry, d_child) grab d->d_lock if simple_positive(d) bump d->d_count child = d drop d->d_lock if child break drop parent->d_lock dput(prev); return child kill_it(victim, parent) if simple_positive(victim) d_invalidate(victim); // needed to avoid lost mounts if victim is a directory fsnotify_rmdir else fsnotify_unlink if victim is regular __debugfs_file_removed(victim) dput(victim) // unpin it recursive_removal(dentry) this = dentry; while (true) { victim = NULL; inode = this->d_inode; inode_lock(inode); if (d_is_dir(this)) mark this doomed while ((child = find_next_child(this, victim)) == NULL) { // no children (left); kill and ascend // update metadata while it's still locked inode->i_ctime = current_time(inode); clear_nlink(inode); inode_unlock(inode); victim = this; this = this->d_parent; inode = this->d_inode; inode_lock(inode); kill_it(victim, this); if (victim == dentry) { inode->i_ctime = inode->i_mtime = current_time(inode); if (d_is_dir(dentry)) drop_nlink(inode); inode_unlock(inode); dput(dentry); return; } } inode_unlock(inode); this = child; }