Re: How to abuse RCU to scan the children of a mount?

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Wed, 4 Mar 2020 11:28:16 -0800

On Wed, Mar 04, 2020 at 05:45:16PM +0000, David Howells wrote:
> Hi Paul,
> 
> As part of trying to implement a filesystem information querying system call,
> I need to be able to return a list of child mounts of a mount.  Children,
> however, may be moved with "mount --move", which means that the list can't be
> accessed with normal RCU practices.
> 
> For reference, I'm dealing with struct mount and mnt_mounts and mnt_child and
> struct mount is released via call_rcu().
> 
> What does rcu_dereference() guarantee, exactly?  It's just that the next hop
> is set up when you follow the pointer, right?
> 
> Can I do something like the attached?  The mount 'm' is pinned, but I need to
> trawl m->mnt_mounts.  mount_lock is a seqlock that ticks when mount topology
> is rearranged.  I *think* it covers ->mnt_mounts.

I took a quick but non-exhaustive look, and didn't find any exceptions.
However, a stress test with lockdep enabled and with the addition of
appropriate lockdep assertions might be helpful here.  I trust lockdep
quite a bit more than I trust my quick looks.  ;-)

>                                                    Whilst trawling in
> non-locked mode, I keep an eye on the seq counter and if it changes, the list
> may have been altered and I need to get the real lock and restart.
> 
> The objects shouldn't disappear or be destroyed, so I think I'm safe.

Famous last words!  ;-)

Huh.  The mount structure isn't suffering from a shortage of list_head
structures, is it?

So the following can happen, then?

o	The __attach_mnt() function adds a struct mount to its parent
	list, but in a non-RCU manner.	Unless there is some other
	safeguard, the list_add_tail() in this function needs to be
	list_add_tail_rcu().

o	I am assuming that the various non-RCU traversals that I see,
	for example, next_mnt(), are protected by lock_mount_hash().
	Especially skip_mnt_tree(), which uses mnt_mounts.prev.  (I didn't
	find any exceptions, but I don't claim an exhaustive search.)

o	The umount_tree() function's use of list_del_init() looks like
	it could trap an RCU reader in the newly singular list formed
	by the removal.  It appears that there are other functions that
	use list_del_init() on this list, though I cannot claim any sort
	of familiarity with this code.

	So, do you need to add a check for child->mnt_child being in this
	self-referential state within fsinfo_generic_mount_children()?

	Plus list_del_init() doesn't mark its stores, though
	some would argue that unmarked stores are OK in this situation.

o	There might be other operations in need of RCU-ification.

	Maybe the list_add_tail() in umount_tree(), but it is not
	immediately clear that this is adding a new element instead of
	re-inserting an element already exposed to readers.

> Thanks,
> David
> ---
> 
> int fsinfo_generic_mount_children(struct path *path, struct fsinfo_context *ctx)
> {
> 	struct fsinfo_mount_child record = {};
> 	struct mount *m, *child;
> 	int seq = 0;
> 
> 	m = real_mount(path->mnt);

What keeps the thing referenced by "m" from going away?  Presumably the
caller has nailed it down somehow, but I have to ask...

> 	rcu_read_lock();
> 	do {
> 		ctx->usage = 0;
> 		read_seqbegin_or_lock(&mount_lock, &seq);

Aside: If there was an updater holding the lock along with a flood of
readers, everyone would hereafter acquire the lock.  Or am I missing
a trick here?  (I would have expected it to try a few times, and only
then acquire the lock.  Yeah, I know, more state would be required.)

> 		list_for_each_entry_rcu(child, &m->mnt_mounts, mnt_child) {
> 			if (need_seqretry(&mount_lock, seq))
> 				break;
> 			if (child->mnt_parent != m)
> 				continue;
> 			record.mnt_unique_id = child->mnt_unique_id;
> 			record.mnt_id = child->mnt_id;
> 			record.notify_sum = 0;
> #ifdef CONFIG_SB_NOTIFICATIONS
> 			record.notify_sum +=
> 				atomic_read(&child->mnt.mnt_sb->s_change_counter) +
> 				atomic_read(&child->mnt.mnt_sb->s_notify_counter);
> #endif
> #ifdef CONFIG_MOUNT_NOTIFICATIONS
> 			record.notify_sum +=
> 				atomic_read(&child->mnt_attr_changes) +
> 				atomic_read(&child->mnt_topology_changes) +
> 				atomic_read(&child->mnt_subtree_notifications);
> #endif
> 			store_mount_fsinfo(ctx, &record);
> 		}
> 	} while (need_seqretry(&mount_lock, seq));
> 	done_seqretry(&mount_lock, seq);
> 
> 	rcu_read_unlock();
> 
> 	/* End the list with a copy of the parameter mount's details so that
> 	 * userspace can quickly check for changes.
> 	 */
> 	record.mnt_unique_id = m->mnt_unique_id;
> 	record.mnt_id = m->mnt_id;
> 	record.notify_sum = 0;
> #ifdef CONFIG_SB_NOTIFICATIONS
> 	record.notify_sum +=
> 		atomic_read(&m->mnt.mnt_sb->s_change_counter) +
> 		atomic_read(&m->mnt.mnt_sb->s_notify_counter);
> #endif
> #ifdef CONFIG_MOUNT_NOTIFICATIONS
> 	record.notify_sum +=
> 		atomic_read(&m->mnt_attr_changes) +
> 		atomic_read(&m->mnt_topology_changes) +
> 		atomic_read(&m->mnt_subtree_notifications);
> #endif
> 	store_mount_fsinfo(ctx, &record);
> 	return ctx->usage;
> }

Other than that, looks legit!  ;-)

							Thanx, Paul