Re: [PATCH v2] xattr: use rbtree for simple_xattrs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Nov 08, 2022 at 12:41:12PM +0100, Christian Brauner wrote:
> A while ago Vasily reported that it is possible to set a large number of
> xattrs on inodes of filesystems that make use of the simple xattr
> infrastructure. This includes all kernfs-based filesystems that support
> xattrs (e.g., cgroupfs) and tmpfs. Both cgroupfs and tmpfs can be
> mounted by unprivileged users in unprivileged containers and root in an
> unprivileged container can set an unrestricted number of security.*
> xattrs and privileged users can also set unlimited trusted.* xattrs. As
> there are apparently users that have a fairly large number of xattrs we
> should scale a bit better. Other xattrs such as user.* are restricted
> for kernfs-based instances to a fairly limited number.
> 
> Using a simple linked list protected by a spinlock used for set, get,
> and list operations doesn't scale well if users use a lot of xattrs even
> if it's not a crazy number. And There's no need to bring in the big guns
> like rhashtables or rw semaphors for this. An rbtree with a seqlock and
> limited rcu semantics is enough.
> 
> It scales within the constraints we are working in. By far the most
> common operations is getting an xattr. The get operation is optimized to
> be lock free as long as there are no writers. The list operation takes
> the read lock and protects against concurrent writers while allowing
> lockless get operations. Locking out other listxattr callers isn't a
> huge deal since listing xattrs is mostly relevant when copying a file or
> copying all xattrs between files.
> 
> Additionally, listxattr() doesn't list the values of xattrs it can only
> be used to list the names of all xattrs set on a file. And the number of
> xattr names that can be listed with listxattr() is limited to
> XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is passed then
> vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr names are
> found it will return -EFBIG. In short, the maximum amount of memory that
> can be retrieved via listxattr() is limited.
> 
> Of course, the API is broken as documented on xattr(7) already. In the
> future we might want to address this but for now this is the world we
> live in and have lived for a long time. But it does indeed mean that
> once an application goes over XATTR_LIST_MAX limit of xattrs set on an
> inode it isn't possible to copy the file and include its xattrs in the
> copy unless the caller knows all xattrs or limits the copy of the xattrs
> to important ones it knows by name (At least for tmpfs, and kernfs-based
> filesystems. Other filesystems might provide ways of achieving this.).
> 
> Also add proper kernel documentation to all the functions.
> A big thanks to Paul for his comments.
> 
> Cc: Vasily Averin <vvs@xxxxxxxxxx>
> Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxx>
> Signed-off-by: Christian Brauner (Microsoft) <brauner@xxxxxxxxxx>

Hi Christian!

This looks fancy, maybe even too fancy :)
I learned from reading your code.

Is there any specific usecase when we care that much about xattr
performance?

I've nothing against the lockless approach, but an rb-tree protected
by a rwlock/semaphore would be probably much simpler and still a great
improvement over the status quo.

Some small nits below.

> ---
> 
> Notes:
>     In addition to this patch I would like to propose that we restrict the number
>     of xattrs for the simple xattr infrastructure via XATTR_MAX_LIST bytes. In
>     other words, we restrict the number of xattrs for simple xattr filesystems to
>     the number of xattrs names that can be retrieved via listxattr(). That should
>     be about 2000 to 3000 xattrs per inode which is more than enough. We should
>     risk this and see if we get any regression reports from userswith this
>     approach.
>     
>     This should be as simple as adding a max_list member to struct simple_xattrs
>     and initialize it with XATTR_MAX_LIST. Set operations would then check against
>     this field whether the new xattr they are trying to set will fit and return
>     -EFBIG otherwise. I think that might be a good approach to get rid of the in
>     principle unbounded number of xattrs that can be set via the simple xattr
>     infrastructure. I think this is a regression risk worth taking.
>     
>     /* v2 */
>     Christian Brauner <brauner@xxxxxxxxxx>:
>     - Fix kernel doc.
>     - Remove accidental leftover union from previous iteration.
> 
>  fs/xattr.c            | 330 +++++++++++++++++++++++++++++++++---------
>  include/linux/xattr.h |  40 ++---
>  mm/shmem.c            |   2 +-
>  3 files changed, 270 insertions(+), 102 deletions(-)
> 
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 61107b6bbed2..f18454161d54 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -992,8 +992,63 @@ const char *xattr_full_name(const struct xattr_handler *handler,
>  }
>  EXPORT_SYMBOL(xattr_full_name);
>  
> -/*
> - * Allocate new xattr and copy in the value; but leave the name to callers.
> +/**
> + * free_simple_xattr - free an xattr object
> + * @xattr: the xattr object
> + *
> + * Free the xattr object. Can handle @xattr being NULL.
> + */
> +static inline void free_simple_xattr(struct simple_xattr *xattr)
> +{
> +	if (xattr)
> +		kfree(xattr->name);
> +	kvfree(xattr);
> +}
> +
> +/**
> + * free_simple_xattr_rcu_cb - callback for freeing xattr object through rcu
> + * @cb: the rcu callback head
> + */
> +static void free_simple_xattr_rcu_cb(struct callback_head *cb)
> +{
> +	free_simple_xattr(container_of(cb, struct simple_xattr, rcu));
> +}
> +
> +/**
> + * free_simple_xattr_rcu - free an xattr object with rcu semantics
> + * @xattr: the xattr object
> + */
> +static void free_simple_xattr_rcu(struct simple_xattr *xattr)
> +{
> +	call_rcu(&xattr->rcu, free_simple_xattr_rcu_cb);
> +}
> +
> +/**
> + * put_simple_xattr_rcu - decrement refcount for xattr object
> + * @xattr: the xattr object
> + *
> + * Decrement the reference count of an xattr object and free it using rcu
> + * semantics if we're the holder of the last reference. Can handle @xattr being
> + * NULL.
> + */
> +static inline void put_simple_xattr_rcu(struct simple_xattr *xattr)
> +{
> +	if (xattr && refcount_dec_and_test(&xattr->ref))
> +		free_simple_xattr_rcu(xattr);
> +}
> +
> +/**
> + * simple_xattr_alloc - allocate new xattr object
> + * @value: value of the xattr object
> + * @size: size of @value
> + *
> + * Allocate a new xattr object and initialize respective members. The caller is
> + * responsible for handling the name of the xattr.
> + *
> + * The initial reference count belongs to the rbtree.
> + *
> + * Return: On success a new xattr object is returned. On failure NULL is
> + * returned.
>   */
>  struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
>  {
> @@ -1011,57 +1066,99 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
>  
>  	new_xattr->size = size;
>  	memcpy(new_xattr->value, value, size);
> +	refcount_set(&new_xattr->ref, 1);

Better to use REFCOUNT_INIT() here to save an atomic operation.

>  	return new_xattr;
>  }
>  
> -/*
> - * xattr GET operation for in-memory/pseudo filesystems
> +/**
> + * simple_xattr_get - get an xattr object
> + * @xattrs: the header of the xattr object
> + * @name: the name of the xattr to retrieve
> + * @buffer: the buffer to store the value into
> + * @size: the size of @buffer
> + *
> + * Try to find and retrieve the xattr object associated with @name. If the
> + * object is found and still in the rbtree bump the reference count.
> + *
> + * If @buffer is provided store the value of @xattr in @buffer.
> + *
> + * Return: On success zero and on error a negative error code is returned.
>   */
>  int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
>  		     void *buffer, size_t size)
>  {
> -	struct simple_xattr *xattr;
> -	int ret = -ENODATA;
> -
> -	spin_lock(&xattrs->lock);
> -	list_for_each_entry(xattr, &xattrs->head, list) {
> -		if (strcmp(name, xattr->name))
> -			continue;
> -
> -		ret = xattr->size;
> -		if (buffer) {
> -			if (size < xattr->size)
> -				ret = -ERANGE;
> -			else
> -				memcpy(buffer, xattr->value, xattr->size);
> +	struct simple_xattr *xattr = NULL;
> +	struct rb_node *rbp;
> +	int ret, seq = 0;
> +
> +	rcu_read_lock();
> +	do {
> +		read_seqbegin_or_lock(&xattrs->lock, &seq);
> +		rbp = rcu_dereference(xattrs->rb_root.rb_node);
> +		while (rbp) {
> +			xattr = rb_entry(rbp, struct simple_xattr, rb_node);
> +			if (strcmp(xattr->name, name) < 0) {
> +				rbp = rcu_dereference(rbp->rb_left);
> +			} else if (strcmp(xattr->name, name) > 0) {
> +				rbp = rcu_dereference(rbp->rb_right);

strcmp() is potentially called twice with same arguments.

> +			} else {
> +				if (!likely(refcount_inc_not_zero(&xattr->ref)))
> +					xattr = NULL;
> +				break;
> +			}
> +			xattr = NULL;
>  		}
> -		break;

If there are many xattrs attached and there is a writer who constantly
changes something, won't readers loop here for a potentially very long time?

> +	} while (need_seqretry(&xattrs->lock, seq));
> +	done_seqretry(&xattrs->lock, seq);
> +	rcu_read_unlock();
> +
> +	if (!xattr)
> +		return -ENODATA;
> +
> +	ret = xattr->size;
> +	if (buffer) {
> +		if (size < xattr->size)
> +			ret = -ERANGE;
> +		else
> +			memcpy(buffer, xattr->value, xattr->size);
>  	}
> -	spin_unlock(&xattrs->lock);
> +
> +	put_simple_xattr_rcu(xattr);
>  	return ret;
>  }
>  
>  /**
> - * simple_xattr_set - xattr SET operation for in-memory/pseudo filesystems
> - * @xattrs: target simple_xattr list
> - * @name: name of the extended attribute
> - * @value: value of the xattr. If %NULL, will remove the attribute.
> - * @size: size of the new xattr
> - * @flags: %XATTR_{CREATE|REPLACE}
> - * @removed_size: returns size of the removed xattr, -1 if none removed
> + * simple_xattr_set - set an xattr object
> + * @xattrs: the header of the xattr object
> + * @name: the name of the xattr to retrieve
> + * @value: the value to store along the xattr
> + * @size: the size of @value
> + * @flags: the flags determining how to set the xattr
> + * @removed_size: the size of the removed xattr
> + *
> + * Set a new xattr object.
> + * If @value is passed a new xattr object will be allocated. If XATTR_REPLACE
> + * is specified in @flags a matching xattr object for @name must already exist.
> + * If it does it will be replace with the new xattr object. If it doesn't we
> + * fail. If XATTR_CREATE is specified and a matching xattr does already exist
> + * we fail. If it doesn't we create a new xattr. If @flags is zero we simply
> + * insert the new xattr replacing any existing one.
>   *
> - * %XATTR_CREATE is set, the xattr shouldn't exist already; otherwise fails
> - * with -EEXIST.  If %XATTR_REPLACE is set, the xattr should exist;
> - * otherwise, fails with -ENODATA.
> + * If @value is empty and a matching xattr object is found we delete it if
> + * XATTR_REPLACE is specified in @flags or @flags is zero.
>   *
> - * Returns 0 on success, -errno on failure.
> + * If @value is empty and no matching xattr object for @name is found we do
> + * nothing if XATTR_CREATE is specified in @flags or @flags is zero. For
> + * XATTR_REPLACE we fail as mentioned above.
> + *
> + * Return: On success zero and on error a negative error code is returned.
>   */
>  int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
>  		     const void *value, size_t size, int flags,
>  		     ssize_t *removed_size)
>  {
> -	struct simple_xattr *xattr;
> -	struct simple_xattr *new_xattr = NULL;
> +	struct simple_xattr *xattr = NULL, *new_xattr = NULL;
> +	struct rb_node *parent = NULL, **rbp;
>  	int err = 0;
>  
>  	if (removed_size)
> @@ -1080,37 +1177,64 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
>  		}
>  	}
>  
> -	spin_lock(&xattrs->lock);
> -	list_for_each_entry(xattr, &xattrs->head, list) {
> -		if (!strcmp(name, xattr->name)) {
> -			if (flags & XATTR_CREATE) {
> -				xattr = new_xattr;
> -				err = -EEXIST;
> -			} else if (new_xattr) {
> -				list_replace(&xattr->list, &new_xattr->list);
> -				if (removed_size)
> -					*removed_size = xattr->size;
> -			} else {
> -				list_del(&xattr->list);
> -				if (removed_size)
> -					*removed_size = xattr->size;
> -			}
> -			goto out;
> -		}
> -	}
> -	if (flags & XATTR_REPLACE) {
> -		xattr = new_xattr;
> -		err = -ENODATA;
> -	} else {
> -		list_add(&new_xattr->list, &xattrs->head);
> +	write_seqlock(&xattrs->lock);
> +	rbp = &xattrs->rb_root.rb_node;
> +	while (*rbp) {
> +		parent = *rbp;
> +		xattr = rb_entry(*rbp, struct simple_xattr, rb_node);
> +		if (strcmp(xattr->name, name) < 0)
> +			rbp = &(*rbp)->rb_left;
> +		else if (strcmp(xattr->name, name) > 0)

No reason to call strcmp() twice.

Thanks!



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux