Re: [PATCH v3] xattr: use rbtree for simple_xattrs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Nov 11, 2022 at 12:53:36PM +0100, Christian Brauner wrote:
> A while ago Vasily reported that it is possible to set a large number of
> xattrs on inodes of filesystems that make use of the simple xattr
> infrastructure. This includes all kernfs-based filesystems that support
> xattrs (e.g., cgroupfs) and tmpfs. Both cgroupfs and tmpfs can be
> mounted by unprivileged users in unprivileged containers and root in an
> unprivileged container can set an unrestricted number of security.*
> xattrs and privileged users can also set unlimited trusted.* xattrs. As
> there are apparently users that have a fairly large number of xattrs we
> should scale a bit better. Other xattrs such as user.* are restricted
> for kernfs-based instances to a fairly limited number.
> 
> Using a simple linked list protected by a spinlock used for set, get,
> and list operations doesn't scale well if users use a lot of xattrs even
> if it's not a crazy number. There's no need to bring in the big guns
> like rhashtables or rw semaphores for this. An rbtree with a rwlock, or
> limited rcu semanics and seqlock is enough.
> 
> It scales within the constraints we are working in. By far the most
> common operation is getting an xattr. Setting xattrs should be a
> moderately rare operation. And listxattr() often only happens when
> copying xattrs between files or with the filey. Holding a lock across
> listxattr() is unproblematic because it doesn't list the values of
> xattrs. It can only be used to list the names of all xattrs set on a
> file. And the number of xattr names that can be listed with listxattr()
> is limited to XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is
> passed then vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr
> names are found it will return -EFBIG. In short, the maximum amount of
> memory that can be retrieved via listxattr() is limited.
> 
> Of course, the API is broken as documented on xattr(7) already. In the
> future we might want to address this but for now this is the world we
> live in and have lived for a long time. But it does indeed mean that
> once an application goes over XATTR_LIST_MAX limit of xattrs set on an
> inode it isn't possible to copy the file and include its xattrs in the
> copy unless the caller knows all xattrs or limits the copy of the xattrs
> to important ones it knows by name (At least for tmpfs, and kernfs-based
> filesystems. Other filesystems might provide ways of achieving this.).
> 
> Bonus of this port to rbtree+rwlock is that we shrink the memory
> consumption for users of the simple xattr infrastructure.
> 
> Also add proper kernel documentation to all the functions.
> A big thanks to Paul for his comments.
> 
> Cc: Vasily Averin <vvs@xxxxxxxxxx>
> Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxx>
> Signed-off-by: Christian Brauner (Microsoft) <brauner@xxxxxxxxxx>

Acked-by: Paul E. McKenney <paulmck@xxxxxxxxxx>

> ---
> 
> Notes:
>     In the v1 and v2 of this patch we used an rbtree which was protected by an
>     rcu+seqlock combination. During the discussion it became clear that there's
>     some valid concern how safe it is to combine rcu with rbtrees. While most of
>     the issues are highly unlikely to matter in practice the code here can be
>     reached by unprivileged users rather directly so we should not be adventurous.
>     Instead of the rcu+seqlock combination simply use an rwlock. This will scale
>     sufficiently as well and had been one of the implementations I considered and
>     wrote a little while ago. Thanks to Paul for some deeper insights into issues
>     associated with rcu and rbtrees!
>     
>     In addition to this patch I would like to propose that we restrict the number
>     of xattrs for the simple xattr infrastructure via XATTR_MAX_LIST bytes. In
>     other words, we restrict the number of xattrs for simple xattr filesystems to
>     the number of xattrs names that can be retrieved via listxattr(). That should
>     be about 2000 to 3000 xattrs per inode which is more than enough. We should
>     risk this and see if we get any regression reports from userswith this
>     approach.
>     
>     This should be as simple as adding a max_list member to struct simple_xattrs
>     and initialize it with XATTR_MAX_LIST. Set operations would then check against
>     this field whether the new xattr they are trying to set will fit and return
>     -EFBIG otherwise. I think that might be a good approach to get rid of the in
>     principle unbounded number of xattrs that can be set via the simple xattr
>     infrastructure. I think this is a regression risk worth taking.
>     
>     /* v2 */
>     Christian Brauner <brauner@xxxxxxxxxx>:
>     - Fix kernel doc.
>     - Remove accidental leftover union from previous iteration.
>     
>     /* v3 */
>     Port the whole thing to use a simple rwlock instead of rcu+seqlock.
>     
>     "Paul E. McKenney" <paulmck@xxxxxxxxxx>:
>     - Fix simple_xattr_add() by searching the correct slot in the rbtree first.
>     
>     Roman Gushchin <roman.gushchin@xxxxxxxxx>:
>     - Avoid calling strcmp() multiple times.
> 
>  fs/xattr.c            | 317 +++++++++++++++++++++++++++++++++---------
>  include/linux/xattr.h |  38 ++---
>  mm/shmem.c            |   2 +-
>  3 files changed, 260 insertions(+), 97 deletions(-)
> 
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 61107b6bbed2..402d9d43fde0 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -992,8 +992,29 @@ const char *xattr_full_name(const struct xattr_handler *handler,
>  }
>  EXPORT_SYMBOL(xattr_full_name);
>  
> -/*
> - * Allocate new xattr and copy in the value; but leave the name to callers.
> +/**
> + * free_simple_xattr - free an xattr object
> + * @xattr: the xattr object
> + *
> + * Free the xattr object. Can handle @xattr being NULL.
> + */
> +static inline void free_simple_xattr(struct simple_xattr *xattr)
> +{
> +	if (xattr)
> +		kfree(xattr->name);
> +	kvfree(xattr);
> +}
> +
> +/**
> + * simple_xattr_alloc - allocate new xattr object
> + * @value: value of the xattr object
> + * @size: size of @value
> + *
> + * Allocate a new xattr object and initialize respective members. The caller is
> + * responsible for handling the name of the xattr.
> + *
> + * Return: On success a new xattr object is returned. On failure NULL is
> + * returned.
>   */
>  struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
>  {
> @@ -1014,20 +1035,69 @@ struct simple_xattr *simple_xattr_alloc(const void *value, size_t size)
>  	return new_xattr;
>  }
>  
> -/*
> - * xattr GET operation for in-memory/pseudo filesystems
> +/**
> + * rbtree_simple_xattr_cmp - compare xattr name with current rbtree xattr entry
> + * @key: xattr name
> + * @node: current node
> + *
> + * Compare the xattr name with the xattr name attached to @node in the rbtree.
> + *
> + * Return: Negative value if continuing left, positive if continuing right, 0
> + * if the xattr attached to @node matches @key.
> + */
> +static int rbtree_simple_xattr_cmp(const void *key, const struct rb_node *node)
> +{
> +	const char *xattr_name = key;
> +	const struct simple_xattr *xattr;
> +
> +	xattr = rb_entry(node, struct simple_xattr, rb_node);
> +	return strcmp(xattr->name, xattr_name);
> +}
> +
> +/**
> + * rbtree_simple_xattr_node_cmp - compare two xattr rbtree nodes
> + * @new_node: new node
> + * @node: current node
> + *
> + * Compare the xattr attached to @new_node with the xattr attached to @node.
> + *
> + * Return: Negative value if continuing left, positive if continuing right, 0
> + * if the xattr attached to @new_node matches the xattr attached to @node.
> + */
> +static int rbtree_simple_xattr_node_cmp(struct rb_node *new_node,
> +					const struct rb_node *node)
> +{
> +	struct simple_xattr *xattr;
> +	xattr = rb_entry(new_node, struct simple_xattr, rb_node);
> +	return rbtree_simple_xattr_cmp(xattr->name, node);
> +}
> +
> +/**
> + * simple_xattr_get - get an xattr object
> + * @xattrs: the header of the xattr object
> + * @name: the name of the xattr to retrieve
> + * @buffer: the buffer to store the value into
> + * @size: the size of @buffer
> + *
> + * Try to find and retrieve the xattr object associated with @name.
> + * If @buffer is provided store the value of @xattr in @buffer
> + * otherwise just return the length. The size of @buffer is limited
> + * to XATTR_SIZE_MAX which currently is 65536.
> + *
> + * Return: On success the length of the xattr value is returned. On error a
> + * negative error code is returned.
>   */
>  int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
>  		     void *buffer, size_t size)
>  {
> -	struct simple_xattr *xattr;
> +	struct simple_xattr *xattr = NULL;
> +	struct rb_node *rbp;
>  	int ret = -ENODATA;
>  
> -	spin_lock(&xattrs->lock);
> -	list_for_each_entry(xattr, &xattrs->head, list) {
> -		if (strcmp(name, xattr->name))
> -			continue;
> -
> +	read_lock(&xattrs->lock);
> +	rbp = rb_find(name, &xattrs->rb_root, rbtree_simple_xattr_cmp);
> +	if (rbp) {
> +		xattr = rb_entry(rbp, struct simple_xattr, rb_node);
>  		ret = xattr->size;
>  		if (buffer) {
>  			if (size < xattr->size)
> @@ -1035,34 +1105,44 @@ int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
>  			else
>  				memcpy(buffer, xattr->value, xattr->size);
>  		}
> -		break;
>  	}
> -	spin_unlock(&xattrs->lock);
> +	read_unlock(&xattrs->lock);
>  	return ret;
>  }
>  
>  /**
> - * simple_xattr_set - xattr SET operation for in-memory/pseudo filesystems
> - * @xattrs: target simple_xattr list
> - * @name: name of the extended attribute
> - * @value: value of the xattr. If %NULL, will remove the attribute.
> - * @size: size of the new xattr
> - * @flags: %XATTR_{CREATE|REPLACE}
> - * @removed_size: returns size of the removed xattr, -1 if none removed
> + * simple_xattr_set - set an xattr object
> + * @xattrs: the header of the xattr object
> + * @name: the name of the xattr to retrieve
> + * @value: the value to store along the xattr
> + * @size: the size of @value
> + * @flags: the flags determining how to set the xattr
> + * @removed_size: the size of the removed xattr
> + *
> + * Set a new xattr object.
> + * If @value is passed a new xattr object will be allocated. If XATTR_REPLACE
> + * is specified in @flags a matching xattr object for @name must already exist.
> + * If it does it will be replaced with the new xattr object. If it doesn't we
> + * fail. If XATTR_CREATE is specified and a matching xattr does already exist
> + * we fail. If it doesn't we create a new xattr. If @flags is zero we simply
> + * insert the new xattr replacing any existing one.
> + *
> + * If @value is empty and a matching xattr object is found we delete it if
> + * XATTR_REPLACE is specified in @flags or @flags is zero.
>   *
> - * %XATTR_CREATE is set, the xattr shouldn't exist already; otherwise fails
> - * with -EEXIST.  If %XATTR_REPLACE is set, the xattr should exist;
> - * otherwise, fails with -ENODATA.
> + * If @value is empty and no matching xattr object for @name is found we do
> + * nothing if XATTR_CREATE is specified in @flags or @flags is zero. For
> + * XATTR_REPLACE we fail as mentioned above.
>   *
> - * Returns 0 on success, -errno on failure.
> + * Return: On success zero and on error a negative error code is returned.
>   */
>  int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
>  		     const void *value, size_t size, int flags,
>  		     ssize_t *removed_size)
>  {
> -	struct simple_xattr *xattr;
> -	struct simple_xattr *new_xattr = NULL;
> -	int err = 0;
> +	struct simple_xattr *xattr = NULL, *new_xattr = NULL;
> +	struct rb_node *parent = NULL, **rbp;
> +	int err = 0, ret;
>  
>  	if (removed_size)
>  		*removed_size = -1;
> @@ -1075,42 +1155,68 @@ int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
>  
>  		new_xattr->name = kstrdup(name, GFP_KERNEL);
>  		if (!new_xattr->name) {
> -			kvfree(new_xattr);
> +			free_simple_xattr(new_xattr);
>  			return -ENOMEM;
>  		}
>  	}
>  
> -	spin_lock(&xattrs->lock);
> -	list_for_each_entry(xattr, &xattrs->head, list) {
> -		if (!strcmp(name, xattr->name)) {
> -			if (flags & XATTR_CREATE) {
> -				xattr = new_xattr;
> -				err = -EEXIST;
> -			} else if (new_xattr) {
> -				list_replace(&xattr->list, &new_xattr->list);
> -				if (removed_size)
> -					*removed_size = xattr->size;
> -			} else {
> -				list_del(&xattr->list);
> -				if (removed_size)
> -					*removed_size = xattr->size;
> -			}
> -			goto out;
> -		}
> -	}
> -	if (flags & XATTR_REPLACE) {
> -		xattr = new_xattr;
> -		err = -ENODATA;
> -	} else {
> -		list_add(&new_xattr->list, &xattrs->head);
> -		xattr = NULL;
> +	write_lock(&xattrs->lock);
> +	rbp = &xattrs->rb_root.rb_node;
> +	while (*rbp) {
> +		parent = *rbp;
> +		ret = rbtree_simple_xattr_cmp(name, *rbp);
> +		if (ret < 0)
> +			rbp = &(*rbp)->rb_left;
> +		else if (ret > 0)
> +			rbp = &(*rbp)->rb_right;
> +		else
> +			xattr = rb_entry(*rbp, struct simple_xattr, rb_node);
> +		if (xattr)
> +			break;
>  	}
> -out:
> -	spin_unlock(&xattrs->lock);
> +
>  	if (xattr) {
> -		kfree(xattr->name);
> -		kvfree(xattr);
> +		/* Fail if XATTR_CREATE is requested and the xattr exists. */
> +		if (flags & XATTR_CREATE) {
> +			err = -EEXIST;
> +			goto out_unlock;
> +		}
> +
> +		if (new_xattr)
> +			rb_replace_node(&xattr->rb_node, &new_xattr->rb_node,
> +					&xattrs->rb_root);
> +		else
> +			rb_erase(&xattr->rb_node, &xattrs->rb_root);
> +		if (!err && removed_size)
> +			*removed_size = xattr->size;
> +	} else {
> +		/* Fail if XATTR_REPLACE is requested but no xattr is found. */
> +		if (flags & XATTR_REPLACE) {
> +			err = -ENODATA;
> +			goto out_unlock;
> +		}
> +
> +		/*
> +		 * If XATTR_CREATE or no flags are specified together with a
> +		 * new value simply insert it.
> +		 */
> +		if (new_xattr) {
> +			rb_link_node(&new_xattr->rb_node, parent, rbp);
> +			rb_insert_color(&new_xattr->rb_node, &xattrs->rb_root);
> +		}
> +
> +		/*
> +		 * If XATTR_CREATE or no flags are specified and neither an
> +		 * old or new xattr exist then we don't need to do anything.
> +		 */
>  	}
> +
> +out_unlock:
> +	write_unlock(&xattrs->lock);
> +	if (err)
> +		free_simple_xattr(new_xattr);
> +	else
> +		free_simple_xattr(xattr);
>  	return err;
>  
>  }
> @@ -1134,14 +1240,31 @@ static int xattr_list_one(char **buffer, ssize_t *remaining_size,
>  	return 0;
>  }
>  
> -/*
> - * xattr LIST operation for in-memory/pseudo filesystems
> +/**
> + * simple_xattr_list - list all xattr objects
> + * @inode: inode from which to get the xattrs
> + * @xattrs: the header of the xattr object
> + * @buffer: the buffer to store all xattrs into
> + * @size: the size of @buffer
> + *
> + * List all xattrs associated with @inode. If @buffer is NULL we returned
> + * the required size of the buffer. If @buffer is provided we store the
> + * xattrs value into it provided it is big enough.
> + *
> + * Note, the number of xattr names that can be listed with listxattr(2) is
> + * limited to XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is passed
> + * then vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr names
> + * are found it will return -E2BIG.
> + *
> + * Return: On success the required size or the size of the copied xattrs is
> + * returned. On error a negative error code is returned.
>   */
>  ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
>  			  char *buffer, size_t size)
>  {
>  	bool trusted = capable(CAP_SYS_ADMIN);
>  	struct simple_xattr *xattr;
> +	struct rb_node *rbp;
>  	ssize_t remaining_size = size;
>  	int err = 0;
>  
> @@ -1162,8 +1285,10 @@ ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
>  	}
>  #endif
>  
> -	spin_lock(&xattrs->lock);
> -	list_for_each_entry(xattr, &xattrs->head, list) {
> +	read_lock(&xattrs->lock);
> +	for (rbp = rb_first(&xattrs->rb_root); rbp; rbp = rb_next(rbp)) {
> +		xattr = rb_entry(rbp, struct simple_xattr, rb_node);
> +
>  		/* skip "trusted." attributes for unprivileged callers */
>  		if (!trusted && xattr_is_trusted(xattr->name))
>  			continue;
> @@ -1172,18 +1297,76 @@ ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
>  		if (err)
>  			break;
>  	}
> -	spin_unlock(&xattrs->lock);
> +	read_unlock(&xattrs->lock);
>  
>  	return err ? err : size - remaining_size;
>  }
>  
> -/*
> - * Adds an extended attribute to the list
> +/**
> + * rbtree_simple_xattr_less - compare two xattr rbtree nodes
> + * @new_node: new node
> + * @node: current node
> + *
> + * Compare the xattr attached to @new_node with the xattr attached to @node.
> + * Note that this function technically tolerates duplicate entries.
> + *
> + * Return: True if insertion point in the rbtree is found.
>   */
> -void simple_xattr_list_add(struct simple_xattrs *xattrs,
> -			   struct simple_xattr *new_xattr)
> +static bool rbtree_simple_xattr_less(struct rb_node *new_node,
> +				     const struct rb_node *node)
>  {
> -	spin_lock(&xattrs->lock);
> -	list_add(&new_xattr->list, &xattrs->head);
> -	spin_unlock(&xattrs->lock);
> +	return rbtree_simple_xattr_node_cmp(new_node, node) < 0;
> +}
> +
> +/**
> + * simple_xattr_add - add xattr objects
> + * @xattrs: the header of the xattr object
> + * @new_xattr: the xattr object to add
> + *
> + * Add an xattr object to @xattrs. This assumes no replacement or removal
> + * of matching xattrs is wanted. Should only be called during inode
> + * initialization when a few distinct initial xattrs are supposed to be set.
> + */
> +void simple_xattr_add(struct simple_xattrs *xattrs,
> +		      struct simple_xattr *new_xattr)
> +{
> +	write_lock(&xattrs->lock);
> +	rb_add(&new_xattr->rb_node, &xattrs->rb_root, rbtree_simple_xattr_less);
> +	write_unlock(&xattrs->lock);
> +}
> +
> +/**
> + * simple_xattrs_init - initialize new xattr header
> + * @xattrs: header to initialize
> + *
> + * Initialize relevant fields of a an xattr header.
> + */
> +void simple_xattrs_init(struct simple_xattrs *xattrs)
> +{
> +	xattrs->rb_root = RB_ROOT;
> +	rwlock_init(&xattrs->lock);
> +}
> +
> +/**
> + * simple_xattrs_free - free xattrs
> + * @xattrs: xattr header whose xattrs to destroy
> + *
> + * Destroy all xattrs in @xattr. When this is called no one can hold a
> + * reference to any of the xattrs anymore.
> + */
> +void simple_xattrs_free(struct simple_xattrs *xattrs)
> +{
> +	struct rb_node *rbp;
> +
> +	rbp = rb_first(&xattrs->rb_root);
> +	while (rbp) {
> +		struct simple_xattr *xattr;
> +		struct rb_node *rbp_next;
> +
> +		rbp_next = rb_next(rbp);
> +		xattr = rb_entry(rbp, struct simple_xattr, rb_node);
> +		rb_erase(&xattr->rb_node, &xattrs->rb_root);
> +		free_simple_xattr(xattr);
> +		rbp = rbp_next;
> +	}
>  }
> diff --git a/include/linux/xattr.h b/include/linux/xattr.h
> index 4c379d23ec6e..b559c6bbcad0 100644
> --- a/include/linux/xattr.h
> +++ b/include/linux/xattr.h
> @@ -80,48 +80,28 @@ static inline const char *xattr_prefix(const struct xattr_handler *handler)
>  }
>  
>  struct simple_xattrs {
> -	struct list_head head;
> -	spinlock_t lock;
> +	struct rb_root rb_root;
> +	rwlock_t lock;
>  };
>  
>  struct simple_xattr {
> -	struct list_head list;
> +	struct rb_node rb_node;
>  	char *name;
>  	size_t size;
>  	char value[];
>  };
>  
> -/*
> - * initialize the simple_xattrs structure
> - */
> -static inline void simple_xattrs_init(struct simple_xattrs *xattrs)
> -{
> -	INIT_LIST_HEAD(&xattrs->head);
> -	spin_lock_init(&xattrs->lock);
> -}
> -
> -/*
> - * free all the xattrs
> - */
> -static inline void simple_xattrs_free(struct simple_xattrs *xattrs)
> -{
> -	struct simple_xattr *xattr, *node;
> -
> -	list_for_each_entry_safe(xattr, node, &xattrs->head, list) {
> -		kfree(xattr->name);
> -		kvfree(xattr);
> -	}
> -}
> -
> +void simple_xattrs_init(struct simple_xattrs *xattrs);
> +void simple_xattrs_free(struct simple_xattrs *xattrs);
>  struct simple_xattr *simple_xattr_alloc(const void *value, size_t size);
>  int simple_xattr_get(struct simple_xattrs *xattrs, const char *name,
>  		     void *buffer, size_t size);
>  int simple_xattr_set(struct simple_xattrs *xattrs, const char *name,
>  		     const void *value, size_t size, int flags,
>  		     ssize_t *removed_size);
> -ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs, char *buffer,
> -			  size_t size);
> -void simple_xattr_list_add(struct simple_xattrs *xattrs,
> -			   struct simple_xattr *new_xattr);
> +ssize_t simple_xattr_list(struct inode *inode, struct simple_xattrs *xattrs,
> +			  char *buffer, size_t size);
> +void simple_xattr_add(struct simple_xattrs *xattrs,
> +		      struct simple_xattr *new_xattr);
>  
>  #endif	/* _LINUX_XATTR_H */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 8280a5cb48df..2872e6607b2c 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3255,7 +3255,7 @@ static int shmem_initxattrs(struct inode *inode,
>  		memcpy(new_xattr->name + XATTR_SECURITY_PREFIX_LEN,
>  		       xattr->name, len);
>  
> -		simple_xattr_list_add(&info->xattrs, new_xattr);
> +		simple_xattr_add(&info->xattrs, new_xattr);
>  	}
>  
>  	return 0;
> 
> base-commit: 9abf2313adc1ca1b6180c508c25f22f9395cc780
> -- 
> 2.34.1
> 



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux