Re: [PATCH v2] xattr: use rbtree for simple_xattrs

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Fri, 11 Nov 2022 10:45:09 -0800

On Fri, Nov 11, 2022 at 11:51:23AM +0100, Christian Brauner wrote:
> On Fri, Nov 11, 2022 at 12:13:35AM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 09, 2022 at 10:51:52AM +0100, Christian Brauner wrote:
> > > On Tue, Nov 08, 2022 at 10:59:04AM -0800, Paul E. McKenney wrote:
> > > > On Tue, Nov 08, 2022 at 12:41:12PM +0100, Christian Brauner wrote:
> > > > > A while ago Vasily reported that it is possible to set a large number of
> > > > > xattrs on inodes of filesystems that make use of the simple xattr
> > > > > infrastructure. This includes all kernfs-based filesystems that support
> > > > > xattrs (e.g., cgroupfs) and tmpfs. Both cgroupfs and tmpfs can be
> > > > > mounted by unprivileged users in unprivileged containers and root in an
> > > > > unprivileged container can set an unrestricted number of security.*
> > > > > xattrs and privileged users can also set unlimited trusted.* xattrs. As
> > > > > there are apparently users that have a fairly large number of xattrs we
> > > > > should scale a bit better. Other xattrs such as user.* are restricted
> > > > > for kernfs-based instances to a fairly limited number.
> > > > > 
> > > > > Using a simple linked list protected by a spinlock used for set, get,
> > > > > and list operations doesn't scale well if users use a lot of xattrs even
> > > > > if it's not a crazy number. And There's no need to bring in the big guns
> > > > > like rhashtables or rw semaphors for this. An rbtree with a seqlock and
> > > > > limited rcu semantics is enough.
> > > > > 
> > > > > It scales within the constraints we are working in. By far the most
> > > > > common operations is getting an xattr. The get operation is optimized to
> > > > > be lock free as long as there are no writers. The list operation takes
> > > > > the read lock and protects against concurrent writers while allowing
> > > > > lockless get operations. Locking out other listxattr callers isn't a
> > > > > huge deal since listing xattrs is mostly relevant when copying a file or
> > > > > copying all xattrs between files.
> > > > > 
> > > > > Additionally, listxattr() doesn't list the values of xattrs it can only
> > > > > be used to list the names of all xattrs set on a file. And the number of
> > > > > xattr names that can be listed with listxattr() is limited to
> > > > > XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is passed then
> > > > > vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr names are
> > > > > found it will return -EFBIG. In short, the maximum amount of memory that
> > > > > can be retrieved via listxattr() is limited.
> > > > > 
> > > > > Of course, the API is broken as documented on xattr(7) already. In the
> > > > > future we might want to address this but for now this is the world we
> > > > > live in and have lived for a long time. But it does indeed mean that
> > > > > once an application goes over XATTR_LIST_MAX limit of xattrs set on an
> > > > > inode it isn't possible to copy the file and include its xattrs in the
> > > > > copy unless the caller knows all xattrs or limits the copy of the xattrs
> > > > > to important ones it knows by name (At least for tmpfs, and kernfs-based
> > > > > filesystems. Other filesystems might provide ways of achieving this.).
> > > > > 
> > > > > Also add proper kernel documentation to all the functions.
> > > > > A big thanks to Paul for his comments.
> > > > > 
> > > > > Cc: Vasily Averin <vvs@xxxxxxxxxx>
> > > > > Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxx>
> > > > > Signed-off-by: Christian Brauner (Microsoft) <brauner@xxxxxxxxxx>
> > > > 
> > > > Looks mostly plausible from an RCU viewpoint, but there are a few
> > > > questions/comments inline below.
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > > ---
> > > > > 
> > > > > Notes:
> > > > >     In addition to this patch I would like to propose that we restrict the number
> > > > >     of xattrs for the simple xattr infrastructure via XATTR_MAX_LIST bytes. In
> > > > >     other words, we restrict the number of xattrs for simple xattr filesystems to
> > > > >     the number of xattrs names that can be retrieved via listxattr(). That should
> > > > >     be about 2000 to 3000 xattrs per inode which is more than enough. We should
> > > > >     risk this and see if we get any regression reports from userswith this
> > > > >     approach.
> > > > >     
> > > > >     This should be as simple as adding a max_list member to struct simple_xattrs
> > > > >     and initialize it with XATTR_MAX_LIST. Set operations would then check against
> > > > >     this field whether the new xattr they are trying to set will fit and return
> > > > >     -EFBIG otherwise. I think that might be a good approach to get rid of the in
> > > > >     principle unbounded number of xattrs that can be set via the simple xattr
> > > > >     infrastructure. I think this is a regression risk worth taking.
> > > > >     
> > > > >     /* v2 */
> > > > >     Christian Brauner <brauner@xxxxxxxxxx>:
> > > > >     - Fix kernel doc.
> > > > >     - Remove accidental leftover union from previous iteration.
> > > > > 
> > > > >  fs/xattr.c            | 330 +++++++++++++++++++++++++++++++++---------
> > > > >  include/linux/xattr.h |  40 ++---
> > > > >  mm/shmem.c            |   2 +-
> > > > >  3 files changed, 270 insertions(+), 102 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xattr.c b/fs/xattr.c
> > > > > index 61107b6bbed2..f18454161d54 100644
> > > > > --- a/fs/xattr.c
> > > > > +++ b/fs/xattr.c
> > > > > @@ -992,8 +992,63 @@ const char *xattr_full_name(const struct xattr_handler *handler,
> > > > >  }
> > > > >  EXPORT_SYMBOL(xattr_full_name);
> > > > >  
> > > > > -/*
> > > > > - * Allocate new xattr and copy in the value; but leave the name to callers.
> > > > > +/**
> > > > > + * free_simple_xattr - free an xattr object
> > > > > + * @xattr: the xattr object
> > > > > + *
> > > > > + * Free the xattr object. Can handle @xattr being NULL.
> > > > > + */
> > > > > +static inline void free_simple_xattr(struct simple_xattr *xattr)
> > > > > +{
> > > > > +	if (xattr)
> > > > > +		kfree(xattr->name);
> > > > > +	kvfree(xattr);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * free_simple_xattr_rcu_cb - callback for freeing xattr object through rcu
> > > > > + * @cb: the rcu callback head
> > > > > + */
> > > > > +static void free_simple_xattr_rcu_cb(struct callback_head *cb)
> > > > > +{
> > > > > +	free_simple_xattr(container_of(cb, struct simple_xattr, rcu));
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * free_simple_xattr_rcu - free an xattr object with rcu semantics
> > > > > + * @xattr: the xattr object
> > > > > + */
> > > > > +static void free_simple_xattr_rcu(struct simple_xattr *xattr)
> > > > > +{
> > > > > +	call_rcu(&xattr->rcu, free_simple_xattr_rcu_cb);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * put_simple_xattr_rcu - decrement refcount for xattr object
> > > > > + * @xattr: the xattr object
> > > > > + *
> > > > > + * Decrement the reference count of an xattr object and free it using rcu
> > > > > + * semantics if we're the holder of the last reference. Can handle @xattr being
> > > > > + * NULL.
> > > > > + */
> > > > > +static inline void put_simple_xattr_rcu(struct simple_xattr *xattr)
> > > > > +{
> > > > > +	if (xattr && refcount_dec_and_test(&xattr->ref))
> > > > > +		free_simple_xattr_rcu(xattr);
> > > > > +}
> > > > 
> > > > Looks like the standard combined reference counter and RCU combination,
> > > > goog!
> > > > 
> > > > > +
> > > > > +/**
> > > > > + * simple_xattr_alloc - allocate new xattr object
> > > > > + * @value: value of the xattr object
> > > > > + * @size: size of @value
> > > > > + *
> > > > > + * Allocate a new xattr object and initialize respective members. The caller is
> > > > > + * responsible for handling the name of the xattr.
> > > > > + *
> > > > > + * The initial reference count belongs to the rbtree.
> > > > > + *
> > > > > + * Return: On success a new xattr object is returned. On failure NULL is
> > > > > + * returned.
> > > > >   */
> > > > >  struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
> > > > >  {
> > > > > @@ -1011,57 +1066,99 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
> > > > >  
> > > > >  	new_xattr->size = size;
> > > > >  	memcpy(new_xattr->value, value, size);
> > > > > +	refcount_set(&new_xattr->ref, 1);
> > > > 
> > > > Yes, one is usually needed for the link in the tree.
> > > > 
> > > > >  	return new_xattr;
> > > > >  }
> > > > >  
> > > > > -/*
> > > > > - * xattr GET operation for in-memory/pseudo filesystems
> > > > > +/**
> > > > > + * simple_xattr_get - get an xattr object
> > > > > + * @xattrs: the header of the xattr object
> > > > > + * @name: the name of the xattr to retrieve
> > > > > + * @buffer: the buffer to store the value into
> > > > > + * @size: the size of @buffer
> > > > > + *
> > > > > + * Try to find and retrieve the xattr object associated with @name. If the
> > > > > + * object is found and still in the rbtree bump the reference count.
> > > > > + *
> > > > > + * If @buffer is provided store the value of @xattr in @buffer.
> > > > > + *
> > > > > + * Return: On success zero and on error a negative error code is returned.
> > > > >   */
> > > > >  int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
> > > > >  		     void *buffer, size_t size)
> > > > >  {
> > > > > -	struct simple_xattr *xattr;
> > > > > -	int ret = -ENODATA;
> > > > > -
> > > > > -	spin_lock(&xattrs->lock);
> > > > > -	list_for_each_entry(xattr, &xattrs->head, list) {
> > > > > -		if (strcmp(name, xattr->name))
> > > > > -			continue;
> > > > > -
> > > > > -		ret = xattr->size;
> > > > > -		if (buffer) {
> > > > > -			if (size < xattr->size)
> > > > > -				ret = -ERANGE;
> > > > > -			else
> > > > > -				memcpy(buffer, xattr->value, xattr->size);
> > > > > +	struct simple_xattr *xattr = NULL;
> > > > > +	struct rb_node *rbp;
> > > > > +	int ret, seq = 0;
> > > > > +
> > > > > +	rcu_read_lock();
> > > > > +	do {
> > > > > +		read_seqbegin_or_lock(&xattrs->lock, &seq);
> > > > 
> > > > It might be necessary to try a few times before grabbing the lock, but
> > > > perhaps we should actually hit the problem before increasing complexity.
> > > > 
> > > > > +		rbp = rcu_dereference(xattrs->rb_root.rb_node);
> > > > > +		while (rbp) {
> > > > > +			xattr = rb_entry(rbp, struct simple_xattr, rb_node);
> > > > > +			if (strcmp(xattr->name, name) < 0) {
> > > > > +				rbp = rcu_dereference(rbp->rb_left);
> > > > > +			} else if (strcmp(xattr->name, name) > 0) {
> > > > > +				rbp = rcu_dereference(rbp->rb_right);
> > > > > +			} else {
> > > > > +				if (!likely(refcount_inc_not_zero(&xattr->ref)))
> > > > > +					xattr = NULL;
> > > > > +				break;
> > > > > +			}
> > > > > +			xattr = NULL;
> > > > >  		}
> > > > 
> > > > Maybe this is too specialized, but should this be in the rbtree code,
> > > > perhaps rb_find_first_rcu(), but with an appropriate cmp() function?
> > > > The refcount_inc_not_zero() clearly needs to be in the caller.
> > > > 
> > > > If this is the only instance of this sort of code, it is likely not
> > > > really worthwhile.  But if we have several of these open coded, it would
> > > > be good to consolidate that code.
> > > 
> > > There's more than one instance of this pattern for sure.
> > 
> > OK, might be time to consolidate, then.  ;-)
> > 
> > > > > -		break;
> > > > > +	} while (need_seqretry(&xattrs->lock, seq));
> > > > > +	done_seqretry(&xattrs->lock, seq);
> > > > > +	rcu_read_unlock();
> > > > > +
> > > > > +	if (!xattr)
> > > > > +		return -ENODATA;
> > > > > +
> > > > > +	ret = xattr->size;
> > > > > +	if (buffer) {
> > > > > +		if (size < xattr->size)
> > > > > +			ret = -ERANGE;
> > > > > +		else
> > > > > +			memcpy(buffer, xattr->value, xattr->size);
> > > > 
> > > > If all we are doing is copying to an in-kernel buffer, why not dispense
> > > > with xattr->ref and do the memcpy() under rcu_read_lock() protection?
> > > > This would avoid some overhead from reference-count cache misses.
> > > > 
> > > > Of course, if that memcpy() can page fault or some such, then what
> > > > you have is necessary.  But this is all in-kernel, right?  And if not,
> > > > shouldn't the pointers be decorated with __user or some such?
> > > 
> > > This is just a regular in-kernel memcpy(). Good idea dispensing with the
> > > refcount completely.
> > 
> > Sounds good!
> > 
> > > > >  	}
> > > > > -	spin_unlock(&xattrs->lock);
> > > > > +
> > > > > +	put_simple_xattr_rcu(xattr);
> > > > >  	return ret;
> > > > >  }
> > > > >  
> > > > >  /**
> > > > > - * simple_xattr_set - xattr SET operation for in-memory/pseudo filesystems
> > > > > - * @xattrs: target simple_xattr list
> > > > > - * @name: name of the extended attribute
> > > > > - * @value: value of the xattr. If %NULL, will remove the attribute.
> > > > > - * @size: size of the new xattr
> > > > > - * @flags: %XATTR_{CREATE|REPLACE}
> > > > > - * @removed_size: returns size of the removed xattr, -1 if none removed
> > > > > + * simple_xattr_set - set an xattr object
> > > > > + * @xattrs: the header of the xattr object
> > > > > + * @name: the name of the xattr to retrieve
> > > > > + * @value: the value to store along the xattr
> > > > > + * @size: the size of @value
> > > > > + * @flags: the flags determining how to set the xattr
> > > > > + * @removed_size: the size of the removed xattr
> > > > > + *
> > > > > + * Set a new xattr object.
> > > > > + * If @value is passed a new xattr object will be allocated. If XATTR_REPLACE
> > > > > + * is specified in @flags a matching xattr object for @name must already exist.
> > > > > + * If it does it will be replace with the new xattr object. If it doesn't we
> > > > > + * fail. If XATTR_CREATE is specified and a matching xattr does already exist
> > > > > + * we fail. If it doesn't we create a new xattr. If @flags is zero we simply
> > > > > + * insert the new xattr replacing any existing one.
> > > > >   *
> > > > > - * %XATTR_CREATE is set, the xattr shouldn't exist already; otherwise fails
> > > > > - * with -EEXIST.  If %XATTR_REPLACE is set, the xattr should exist;
> > > > > - * otherwise, fails with -ENODATA.
> > > > > + * If @value is empty and a matching xattr object is found we delete it if
> > > > > + * XATTR_REPLACE is specified in @flags or @flags is zero.
> > > > >   *
> > > > > - * Returns 0 on success, -errno on failure.
> > > > > + * If @value is empty and no matching xattr object for @name is found we do
> > > > > + * nothing if XATTR_CREATE is specified in @flags or @flags is zero. For
> > > > > + * XATTR_REPLACE we fail as mentioned above.
> > > > > + *
> > > > > + * Return: On success zero and on error a negative error code is returned.
> > > > >   */
> > > > >  int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> > > > >  		     const void *value, size_t size, int flags,
> > > > >  		     ssize_t *removed_size)
> > > > >  {
> > > > > -	struct simple_xattr *xattr;
> > > > > -	struct simple_xattr *new_xattr = NULL;
> > > > > +	struct simple_xattr *xattr = NULL, *new_xattr = NULL;
> > > > > +	struct rb_node *parent = NULL, **rbp;
> > > > >  	int err = 0;
> > > > >  
> > > > >  	if (removed_size)
> > > > > @@ -1080,37 +1177,64 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
> > > > >  		}
> > > > >  	}
> > > > >  
> > > > > -	spin_lock(&xattrs->lock);
> > > > > -	list_for_each_entry(xattr, &xattrs->head, list) {
> > > > > -		if (!strcmp(name, xattr->name)) {
> > > > > -			if (flags & XATTR_CREATE) {
> > > > > -				xattr = new_xattr;
> > > > > -				err = -EEXIST;
> > > > > -			} else if (new_xattr) {
> > > > > -				list_replace(&xattr->list, &new_xattr->list);
> > > > > -				if (removed_size)
> > > > > -					*removed_size = xattr->size;
> > > > > -			} else {
> > > > > -				list_del(&xattr->list);
> > > > > -				if (removed_size)
> > > > > -					*removed_size = xattr->size;
> > > > > -			}
> > > > > -			goto out;
> > > > > -		}
> > > > > -	}
> > > > > -	if (flags & XATTR_REPLACE) {
> > > > > -		xattr = new_xattr;
> > > > > -		err = -ENODATA;
> > > > > -	} else {
> > > > > -		list_add(&new_xattr->list, &xattrs->head);
> > > > > +	write_seqlock(&xattrs->lock);
> > > > > +	rbp = &xattrs->rb_root.rb_node;
> > > > > +	while (*rbp) {
> > > > > +		parent = *rbp;
> > > > > +		xattr = rb_entry(*rbp, struct simple_xattr, rb_node);
> > > > > +		if (strcmp(xattr->name, name) < 0)
> > > > > +			rbp = &(*rbp)->rb_left;
> > > > > +		else if (strcmp(xattr->name, name) > 0)
> > > > > +			rbp = &(*rbp)->rb_right;
> > > > > +		else
> > > > > +			break;
> > > > >  		xattr = NULL;
> > > > >  	}
> > > > > -out:
> > > > > -	spin_unlock(&xattrs->lock);
> > > > > +
> > > > >  	if (xattr) {
> > > > > -		kfree(xattr->name);
> > > > > -		kvfree(xattr);
> > > > > +		/* Fail if XATTR_CREATE is requested and the xattr exists. */
> > > > > +		if (flags & XATTR_CREATE) {
> > > > > +			err = -EEXIST;
> > > > > +			goto out_unlock;
> > > > > +		}
> > > > > +
> > > > > +		if (new_xattr)
> > > > > +			rb_replace_node_rcu(&xattr->rb_node,
> > > > > +					    &new_xattr->rb_node,
> > > > > +					    &xattrs->rb_root);
> > > > > +		else
> > > > > +			rb_erase(&xattr->rb_node, &xattrs->rb_root);
> > > > 
> > > > Is rb_erase() RCU-reader-safe?  It is not immediately obvious to me that it
> > > > is.  I would expect an rcu_assign_pointer() or three in there somewhere.
> > > 
> > > This question send me down an interesting rabbit hole which is
> > > (unironically) excellent. Afaiu, yes rb_erase() isn't rcu safe. And our
> > > rbtree implementation in general isn't rcu safe in the face of
> > > rebalances (I think you linked to a paper that illustrates how one would
> > > need to go about doing this though.). So all users deal with that using
> > > the seq+rcu-rbtree pattern. I trust you'll correct me if I'm wrong: The
> > > assumption of current users seems to be that searching the rbtree under
> > > the rcu lock - i.e., without having taken the read seqlock - is safe
> > > against oops or iow against corrupt pointers if a writer changes the
> > > rbtree during the walk. The seqlock is supposed to prevent garbage/stale
> > > results. At least all users of this pattern seem to rely on the fact
> > > that a rcu only walk cannot oops. Is that assumption safe?
> > 
> > Kind of, maybe?
> > 
> > To be fully safe, you need to check the seqlock during the actual tree
> > traversal.  Otherwise, an unfortunate series of rebalances or rotations
> > or whatever could potentially run a given reader forever in circles.
> > Though the odds of this actually continuing for any length of time are
> > of course extremely low.
> > 
> > And unmarked accesses to shared variables having conflicting concurrent
> > accesses is playing with fire, though one hopes that stores of pointers
> > would end up wsing single machine instructions.  In theory, there are
> > a number of ways that compiler optimizations could cause trouble.
> 
> Since this is code reachable from unprivileged userspace I'd rather not
> play with fire here at all. Even if all of this might turn out to be
> just a theoretical possibility.
> 
> I think rcu+seq+rbtree would've made for an interesting, sufficiently
> scalable solution with still limited complexity. It's not really too
> hard to understand and isn't bloated. But given that we have some valid
> concerns we should de-fancy-fy and use an rwlock_t. Roman alrady
> mentioned this and I had this is an implementation before this one
> (together with rhashtable as I've mentioned). I'll send out the patches
> shortly. I'm just finishing tests.

Fair enough!

> > > Another point that popped up discussing this was that there's an
> > > unlikely attack scenario that David (Howells) made me aware of. While
> > > walking the rbtree under rcu only without the read seqlock it is
> > > theoretically possible for the rbtree to keep being expanded and cause
> > > the rcu walk to "never finish". Though I'm unsure whether that's really
> > > a practically possible attack.
> > 
> > Perhaps David's scenario is similar to my eternal rebalance?
> 
> Yeah, it certainly was.
> 
> > 
> > > Very interested to hear your thoughts here!
> > 
> > I would of course feel better if there were fewer vulnerabilities.
> 
> I agree so we should change the implementation as mentioned above.
> 
> > 
> > > > > +		if (!err && removed_size)
> > > > > +			*removed_size = xattr->size;
> > > > > +	} else {
> > > > > +		/* Fail if XATTR_REPLACE is requested but no xattr is found. */
> > > > > +		if (flags & XATTR_REPLACE) {
> > > > > +			err = -ENODATA;
> > > > > +			goto out_unlock;
> > > > > +		}
> > > > > +
> > > > > +		/*
> > > > > +		 * If XATTR_CREATE or no flags are specified together with a
> > > > > +		 * new value simply insert it.
> > > > > +		 */
> > > > > +		if (new_xattr) {
> > > > > +			rb_link_node_rcu(&new_xattr->rb_node, parent, rbp);
> > > > > +			rb_insert_color(&new_xattr->rb_node, &xattrs->rb_root);
> > > > > +		}
> > > > > +
> > > > > +		/*
> > > > > +		 * If XATTR_CREATE or no flags are specified and neither an old
> > > > > +		 * or new xattr were found/exist then we don't need to do
> > > > > +		 * anything.
> > > > > +		 */
> > > > 
> > > > As before, some of this looks like it should be in the rbtree implementation.
> > > > On the other hand, and add-or-replace function like this might be rare.  And
> > > > there are only two other occurrences, both of which look quite specialize.
> > > 
> > > If you think the current rcu search pattern is sufficiently safe even
> > > against rb_erase() then we might want to give the two callers/users at
> > > least rb_find_first_rcu() directly in rbtree.h which you suggested
> > > further above.
> > 
> > If we are going to have RCU-protected rbtrees, it might be good for the
> > rbtree code to know that it is supposed to handle RCU traversals.  ;-)
> 
> I honestly think this would make for a pretty nifty bachelor thesis
> project. :)

Makes sense to me!  You have a student in mind?

							Thanx, Paul