Re: [PATCH 1/3] dcache: add a new enum type for 'dentry_d_lock_class'

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Fri, 15 Nov 2019 17:54:23 +0000

On Fri, Nov 15, 2019 at 02:17:54PM +0000, Al Viro wrote:
> On Fri, Nov 15, 2019 at 08:58:05AM -0500, Steven Rostedt wrote:
> > On Fri, 15 Nov 2019 13:48:23 +0000
> > Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> > 
> > > > BTW, what do you mean by "can debugfs_remove_recursive() rely upon the
> > > > lack of attempts to create new entries inside the subtree it's trying
> > > > to kill?"  
> > > 
> > > Is it possible for something to call e.g. debugfs_create_dir() (or any
> > > similar primitive) with parent inside the subtree that has been
> > > passed to debugfs_remove_recursive() call that is still in progress?
> > > 
> > > If debugfs needs to cope with that, debugfs_remove_recursive() needs
> > > considerably heavier locking, to start with.
> > 
> > I don't know about debugfs, but at least tracefs (which cut and pasted
> > from debugfs) does not allow that. At least in theory it doesn't allow
> > that (and if it does, it's a bug in the locking at the higher levels).
> > 
> > And perhaps debugfs shouldn't allow that either. As it is only suppose
> > to be a light weight way to interact with the kernel, hence the name
> > "debugfs".
> > 
> > Yu, do you have a test case for the "infinite loop" case?
> 
> Infinite loop, AFAICS, is reasonably easy to trigger - just open
> a non-empty subdirectory and lseek to e.g. next-to-last element
> in it.  Again, list_empty() use in there is quite wrong - it can
> give false negatives just on the cursors.  No arguments about
> that part...

FWIW, given that debugfs_remove_recursive() has no way to report an error
*and* that we have a single chokepoint for all entry creations (start_creating())
we could make sure nothing gets added pretty easily - just mark the victim
dentry as "don't allow any creations here" when we first reach it and
have start_creating check that, using e.g. inode_lock() for serialization.
Marking could be done e.g. by setting ->d_fsdata to
(void *)DEBUGFS_FSDATA_IS_REAL_FOPS_BIT, so that ->d_release() doesn't
need any changes.

BTW, this
                if (!ret)
                        d_delete(dentry);
                if (d_is_reg(dentry))
                        __debugfs_file_removed(dentry);
                dput(dentry);
in __debugfs_remove() is both subtle and bogus.  If we get here
with refcount > 1, d_delete() is just a d_drop() - dentry can't
be made negative, so it gets unhashed instead.  If we *do* get
here with refcount 1, it will be made negative, all right.
Only to be killed off by immediately following dput(), since
debugfs doesn't retain unpinned dentries.

Why immediate?  Because d_is_reg() is obviously false on
negative dentries, so __debugfs_file_removed() is not called.
That's where the subtle part begins: there should've been
nobody for __debugfs_file_removed() to wait for, since they
would've had to hold additional references to dentry and
refcount wouldn't have been 1.  So that's actually not
a bug.  However, it's too subtle to introduce without
having bothered to even comment the damn thing.

As for the "bogus" part - that d_delete() is bollocks.
First of all, it is fully equivalent to d_drop().  Always
had been.  What's more, that sucker should've been
d_invalidate() instead.

Anyway, AFAICS removal could be done this way:

// parent is held exclusive, after is NULL or a child of parent
find_next_child(parent, prev)
	child = NULL
	node = prev ? &prev->d_child : &parent->d_subdirs;
	grab parent->d_lock
	for each entry in the list starting at node->next
		d = container_of(entry, struct dentry, d_child)
		grab d->d_lock
		if simple_positive(d)
			bump d->d_count
			child = d
		drop d->d_lock
		if child
			break
	drop parent->d_lock
	dput(prev);
	return child

kill_it(victim, parent)
	if simple_positive(victim)
		d_invalidate(victim);	// needed to avoid lost mounts
		if victim is a directory
			fsnotify_rmdir
		else
			fsnotify_unlink
		if victim is regular
			__debugfs_file_removed(victim)
		dput(victim)		// unpin it

recursive_removal(dentry)
	this = dentry;
	while (true) {
		victim = NULL;
		inode = this->d_inode;
		inode_lock(inode);
		if (d_is_dir(this))
			mark this doomed
		while ((child = find_next_child(this, victim)) == NULL) {
			// no children (left); kill and ascend
			// update metadata while it's still locked
			inode->i_ctime = current_time(inode);
			clear_nlink(inode);
			inode_unlock(inode);
			victim = this;
			this = this->d_parent;
			inode = this->d_inode;
			inode_lock(inode);
			kill_it(victim, this);
			if (victim == dentry) {
				inode->i_ctime = inode->i_mtime = current_time(inode);
				if (d_is_dir(dentry))
					drop_nlink(inode);
				inode_unlock(inode);
				dput(dentry);
				return;
			}
		}
		inode_unlock(inode);
		this = child;
	}