Re: [RFC PATCH] getvalues(2) prototype

Christian Brauner <brauner@xxxxxxxxxx> · Wed, 23 Mar 2022 12:42:15 +0100

On Tue, Mar 22, 2022 at 08:27:12PM +0100, Miklos Szeredi wrote:
> Add a new userspace API that allows getting multiple short values in a
> single syscall.
> 
> This would be useful for the following reasons:
> 
> - Calling open/read/close for many small files is inefficient.  E.g. on my
>   desktop invoking lsof(1) results in ~60k open + read + close calls under
>   /proc and 90% of those are 128 bytes or less.
> 
> - Interfaces for getting various attributes and statistics are fragmented.
>   For files we have basic stat, statx, extended attributes, file attributes
>   (for which there are two overlapping ioctl interfaces).  For mounts and
>   superblocks we have stat*fs as well as /proc/$PID/{mountinfo,mountstats}.
>   The latter also has the problem on not allowing queries on a specific
>   mount.
> 
> - Some attributes are cheap to generate, some are expensive.  Allowing
>   userspace to select which ones it needs should allow optimizing queries.
> 
> - Adding an ascii namespace should allow easy extension and self
>   description.
> 
> - The values can be text or binary, whichever is fits best.
> 
> The interface definition is:
> 
> struct name_val {
> 	const char *name;	/* in */
> 	struct iovec value_in;	/* in */
> 	struct iovec value_out;	/* out */
> 	uint32_t error;		/* out */
> 	uint32_t reserved;
> };

Yes, we really need a way to query for various fs information. I'm a bit
torn about the details of this interface though. I would really like if
we had interfaces that are really easy to use from userspace comparable
to statx for example. I know having this generic as possible was the
goal but I'm just a bit uneasy with such interfaces. They become
cumbersome to use in userspace. I'm not sure if the data: part for
example should be in this at all. That seems a bit out of place to me.

Would it be really that bad if we added multiple syscalls for different
types of info? For example, querying mount information could reasonably
be a more focussed separate system call allowing to retrieve detailed
mount propagation info, flags, idmappings and so on. Prior approaches to
solve this in a completely generic way have gotten us not very far too
so I'm a bit worried about this aspect too.

For performance concerns, once those multiple system calls are in place
they will naturally be made available in io_uring and so if people do
really have performance issues they can rely on io_uring (to e.g.,
offload system call cost).

> 
> int getvalues(int dfd, const char *path, struct name_val *vec, size_t num,
> 	      unsigned int flags);
> 
> @dfd and @path are used to lookup object $ORIGIN.  @vec contains @num
> name/value descriptors.  @flags contains lookup flags for @path.
> 
> The syscall returns the number of values filled or an error.
> 
> A single name/value descriptor has the following fields:
> 
> @name describes the object whose value is to be returned.  E.g.
> 
> mnt                    - list of mount parameters
> mnt:mountpoint         - the mountpoint of the mount of $ORIGIN
> mntns                  - list of mount ID's reachable from the current root
> mntns:21:parentid      - parent ID of the mount with ID of 21
> xattr:security.selinux - the security.selinux extended attribute
> data:foo/bar           - the data contained in file $ORIGIN/foo/bar
> 
> If the name starts with the separator, then it is taken to have the same
> prefix as the previous name/value descriptor.  E.g. in the following
> sequence of names the second one is equivalent to mnt:parentid:
> 
> mnt:mountpoint
> :parentid
> 
> @value_in supplies the buffer to store value(s) in.  If a subsequent
> name/value descriptor has NULL value of value_in.iov_base, then the buffer
> from the previous name/value descriptor will be used.  This way it's
> possible to use a shared buffer for multiple values.
> 
> The starting address and length of the actual value will be stored in
> @value_out, unless an error has occurred in which case @error will be set to
> the positive errno value.
> 
> Multiple names starting with the same prefix (including the shorthand form)
> may also be batched together under the same lock, so the order of the names
> can determine atomicity.
> 
> Signed-off-by: Miklos Szeredi <mszeredi@xxxxxxxxxx>
> ---
>  arch/x86/entry/syscalls/syscall_64.tbl |   1 +
>  fs/Makefile                            |   2 +-
>  fs/mount.h                             |   8 +
>  fs/namespace.c                         |  42 ++
>  fs/proc_namespace.c                    |   2 +-
>  fs/values.c                            | 524 +++++++++++++++++++++++++
>  6 files changed, 577 insertions(+), 2 deletions(-)
>  create mode 100644 fs/values.c
> 
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..c72668001b39 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
>  448	common	process_mrelease	sys_process_mrelease
>  449	common	futex_waitv		sys_futex_waitv
>  450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
> +451	common	getvalues		sys_getvalues
>  
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/fs/Makefile b/fs/Makefile
> index 208a74e0b00e..f00d6bcd1178 100644
> --- a/fs/Makefile
> +++ b/fs/Makefile
> @@ -16,7 +16,7 @@ obj-y :=	open.o read_write.o file_table.o super.o \
>  		pnode.o splice.o sync.o utimes.o d_path.o \
>  		stack.o fs_struct.o statfs.o fs_pin.o nsfs.o \
>  		fs_types.o fs_context.o fs_parser.o fsopen.o init.o \
> -		kernel_read_file.o remap_range.o
> +		kernel_read_file.o remap_range.o values.o
>  
>  ifeq ($(CONFIG_BLOCK),y)
>  obj-y +=	buffer.o direct-io.o mpage.o
> diff --git a/fs/mount.h b/fs/mount.h
> index 0b6e08cf8afb..a3ca5233e481 100644
> --- a/fs/mount.h
> +++ b/fs/mount.h
> @@ -148,3 +148,11 @@ static inline bool is_anon_ns(struct mnt_namespace *ns)
>  }
>  
>  extern void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor);
> +
> +extern void namespace_lock_read(void);
> +extern void namespace_unlock_read(void);
> +extern void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt);
> +extern void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns,
> +			 struct path *root);
> +extern struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns,
> +					 struct path *root, int id);
> diff --git a/fs/namespace.c b/fs/namespace.c
> index de6fae84f1a1..52b15c17251f 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1405,6 +1405,38 @@ void mnt_cursor_del(struct mnt_namespace *ns, struct mount *cursor)
>  }
>  #endif  /* CONFIG_PROC_FS */
>  
> +void seq_mnt_list(struct seq_file *seq, struct mnt_namespace *ns,
> +		  struct path *root)
> +{
> +	struct mount *m;
> +
> +	down_read(&namespace_sem);
> +	for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) {
> +		if (is_path_reachable(m, m->mnt.mnt_root, root)) {
> +			seq_printf(seq, "%i", m->mnt_id);
> +			seq_putc(seq, '\0');
> +		}
> +	}
> +	up_read(&namespace_sem);
> +}
> +
> +/* called with namespace_sem held for read */
> +struct vfsmount *mnt_lookup_by_id(struct mnt_namespace *ns, struct path *root,
> +				  int id)
> +{
> +	struct mount *m;
> +
> +	for (m = mnt_list_next(ns, &ns->list); m; m = mnt_list_next(ns, &m->mnt_list)) {
> +		if (m->mnt_id == id) {
> +			if (is_path_reachable(m, m->mnt.mnt_root, root))
> +				return mntget(&m->mnt);
> +			else
> +				return NULL;
> +		}
> +	}
> +	return NULL;
> +}
> +
>  /**
>   * may_umount_tree - check if a mount tree is busy
>   * @m: root of mount tree
> @@ -1494,6 +1526,16 @@ static inline void namespace_lock(void)
>  	down_write(&namespace_sem);
>  }
>  
> +void namespace_lock_read(void)
> +{
> +	down_read(&namespace_sem);
> +}
> +
> +void namespace_unlock_read(void)
> +{
> +	up_read(&namespace_sem);
> +}
> +
>  enum umount_tree_flags {
>  	UMOUNT_SYNC = 1,
>  	UMOUNT_PROPAGATE = 2,
> diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
> index 49650e54d2f8..fa6dc2c20578 100644
> --- a/fs/proc_namespace.c
> +++ b/fs/proc_namespace.c
> @@ -61,7 +61,7 @@ static int show_sb_opts(struct seq_file *m, struct super_block *sb)
>  	return security_sb_show_options(m, sb);
>  }
>  
> -static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
> +void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
>  {
>  	static const struct proc_fs_opts mnt_opts[] = {
>  		{ MNT_NOSUID, ",nosuid" },
> diff --git a/fs/values.c b/fs/values.c
> new file mode 100644
> index 000000000000..618fa9bf48a1
> --- /dev/null
> +++ b/fs/values.c
> @@ -0,0 +1,524 @@
> +#include <linux/syscalls.h>
> +#include <linux/printk.h>
> +#include <linux/namei.h>
> +#include <linux/fs_struct.h>
> +#include <linux/posix_acl_xattr.h>
> +#include <linux/xattr.h>
> +#include "pnode.h"
> +#include "internal.h"
> +
> +#define VAL_GRSEP ':'
> +
> +struct name_val {
> +	const char __user *name;	/* in */
> +	struct iovec value_in;		/* in */
> +	struct iovec value_out;		/* out */
> +	__u32 error;			/* out */
> +	__u32 reserved;
> +};
> +
> +struct val_iter {
> +	struct name_val __user *curr;
> +	size_t num;
> +	struct iovec vec;
> +	char name[256];
> +	size_t bufsize;
> +	struct seq_file seq;
> +	const char *prefix;
> +	const char *sub;
> +};
> +
> +struct val_desc {
> +	const char *name;
> +	union {
> +		int idx;
> +		int (*get)(struct val_iter *vi, const struct path *path);
> +	};
> +};
> +
> +static int val_get(struct val_iter *vi)
> +{
> +	struct name_val nameval;
> +	long err;
> +
> +	if (copy_from_user(&nameval, vi->curr, sizeof(nameval)))
> +		return -EFAULT;
> +
> +	err = strncpy_from_user(vi->name, nameval.name, sizeof(vi->name));
> +	if (err < 0)
> +		return err;
> +	if (err == sizeof(vi->name))
> +		return -ERANGE;
> +
> +	if (nameval.value_in.iov_base)
> +		vi->vec = nameval.value_in;
> +
> +	vi->seq.size = min(vi->vec.iov_len, vi->bufsize);
> +	vi->seq.count = 0;
> +
> +	return 0;
> +}
> +
> +static int val_next(struct val_iter *vi)
> +{
> +	vi->curr++;
> +	vi->num--;
> +
> +	return vi->num ? val_get(vi) : 0;
> +}
> +
> +static int val_end(struct val_iter *vi, size_t count)
> +{
> +	struct iovec iov = {
> +		.iov_base = vi->vec.iov_base,
> +		.iov_len = count,
> +	};
> +
> +	if (copy_to_user(&vi->curr->value_out, &iov, sizeof(iov)))
> +		return -EFAULT;
> +
> +	vi->vec.iov_base += count;
> +	vi->vec.iov_len -= count;
> +
> +	return val_next(vi);
> +}
> +
> +static int val_err(struct val_iter *vi, int err)
> +{
> +	if (put_user(-err, &vi->curr->error))
> +		return -EFAULT;
> +
> +	return val_next(vi);
> +}
> +
> +static int val_end_seq(struct val_iter *vi, int err)
> +{
> +	size_t count = vi->seq.count;
> +
> +	if (err)
> +		return val_err(vi, err);
> +
> +	if (count == vi->seq.size)
> +		return -EOVERFLOW;
> +
> +	if (copy_to_user(vi->vec.iov_base, vi->seq.buf, count))
> +		return -EFAULT;
> +
> +	return val_end(vi, count);
> +}
> +
> +static struct val_desc *val_lookup(struct val_iter *vi, struct val_desc *vd)
> +{
> +	const char *name = vi->name;
> +	const char *prefix = vi->prefix;
> +	size_t prefixlen = strlen(prefix);
> +
> +	if (prefixlen) {
> +		/*
> +		 * Name beggining with a group separator is a shorthand for
> +		 * previously prefix.
> +		 */
> +		if (name[0] == VAL_GRSEP) {
> +			name++;
> +		} else  {
> +			if (strncmp(name, prefix, prefixlen) != 0)
> +				return NULL;
> +			name += prefixlen;
> +		}
> +	}
> +
> +	vi->sub = NULL;
> +	for (; vd->name; vd++) {
> +		if (strcmp(name, vd->name) == 0)
> +			break;
> +		else {
> +			size_t grlen = strlen(vd->name);
> +
> +			if (strncmp(vd->name, name, grlen) == 0 &&
> +			    name[grlen] == VAL_GRSEP) {
> +				vi->sub = name + grlen + 1;
> +				break;
> +			}
> +		}
> +	}
> +	return vd;
> +}
> +
> +static int val_get_group(struct val_iter *vi, struct val_desc *vd)
> +{
> +	for (; vd->name; vd++)
> +		seq_write(&vi->seq, vd->name, strlen(vd->name) + 1);
> +
> +	return val_end_seq(vi, 0);
> +}
> +
> +static bool val_push_prefix(struct val_iter *vi, const char **oldprefix)
> +{
> +	char *newprefix;
> +
> +	newprefix = kmemdup_nul(vi->name, vi->sub - vi->name, GFP_KERNEL);
> +	if (newprefix) {
> +		*oldprefix = vi->prefix;
> +		vi->prefix = newprefix;
> +	}
> +
> +	return newprefix;
> +}
> +
> +static void val_pop_prefix(struct val_iter *vi, const char *oldprefix)
> +{
> +	kfree(vi->prefix);
> +	vi->prefix = oldprefix;
> +}
> +
> +enum {
> +	VAL_MNT_ID,
> +	VAL_MNT_PARENTID,
> +	VAL_MNT_ROOT,
> +	VAL_MNT_MOUNTPOINT,
> +	VAL_MNT_OPTIONS,
> +	VAL_MNT_SHARED,
> +	VAL_MNT_MASTER,
> +	VAL_MNT_PROPAGATE_FROM,
> +	VAL_MNT_UNBINDABLE,
> +	VAL_MNT_NOTFOUND,
> +};
> +
> +static struct val_desc val_mnt_group[] = {
> +	{ .name = "id",			.idx = VAL_MNT_ID		},
> +	{ .name = "parentid",		.idx = VAL_MNT_PARENTID,	},
> +	{ .name = "root",		.idx = VAL_MNT_ROOT,		},
> +	{ .name = "mountpoint",		.idx = VAL_MNT_MOUNTPOINT,	},
> +	{ .name = "options",		.idx = VAL_MNT_OPTIONS,		},
> +	{ .name = "shared",		.idx = VAL_MNT_SHARED,		},
> +	{ .name = "master",		.idx = VAL_MNT_MASTER,		},
> +	{ .name = "propagate_from",	.idx = VAL_MNT_PROPAGATE_FROM,	},
> +	{ .name = "unbindable",		.idx = VAL_MNT_UNBINDABLE,	},
> +	{ .name = NULL,			.idx = VAL_MNT_NOTFOUND		},
> +};
> +
> +static int seq_mnt_root(struct seq_file *seq, struct vfsmount *mnt)
> +{
> +	struct super_block *sb = mnt->mnt_sb;
> +	int err = 0;
> +
> +	if (sb->s_op->show_path) {
> +		err = sb->s_op->show_path(seq, mnt->mnt_root);
> +		if (!err) {
> +			seq_putc(seq, '\0');
> +			if (seq->count < seq->size)
> +				seq->count = string_unescape(seq->buf, seq->buf, seq->size, UNESCAPE_OCTAL);
> +		}
> +	} else {
> +		seq_dentry(seq, mnt->mnt_root, "");
> +	}
> +
> +	return err;
> +}
> +
> +static int val_mnt_show(struct val_iter *vi, struct vfsmount *mnt)
> +{
> +	struct mount *m = real_mount(mnt);
> +	struct path root, mnt_path;
> +	struct val_desc *vd;
> +	const char *oldprefix;
> +	int err = 0;
> +
> +	if (!val_push_prefix(vi, &oldprefix))
> +		return -ENOMEM;
> +
> +	while (!err && vi->num) {
> +		vd = val_lookup(vi, val_mnt_group);
> +		if (!vd)
> +			break;
> +
> +		switch(vd->idx) {
> +		case VAL_MNT_ID:
> +			seq_printf(&vi->seq, "%i", m->mnt_id);
> +			break;
> +		case VAL_MNT_PARENTID:
> +			seq_printf(&vi->seq, "%i", m->mnt_parent->mnt_id);
> +			break;
> +		case VAL_MNT_ROOT:
> +			seq_mnt_root(&vi->seq, mnt);
> +			break;
> +		case VAL_MNT_MOUNTPOINT:
> +			get_fs_root(current->fs, &root);
> +			mnt_path.dentry = mnt->mnt_root;
> +			mnt_path.mnt = mnt;
> +			err = seq_path_root(&vi->seq, &mnt_path, &root, "");
> +			path_put(&root);
> +			break;
> +		case VAL_MNT_OPTIONS:
> +			seq_puts(&vi->seq, mnt->mnt_flags & MNT_READONLY ? "ro" : "rw");
> +			show_mnt_opts(&vi->seq, mnt);
> +			break;
> +		case VAL_MNT_SHARED:
> +			if (IS_MNT_SHARED(m))
> +				seq_printf(&vi->seq, "%i,", m->mnt_group_id);
> +			break;
> +		case VAL_MNT_MASTER:
> +			if (IS_MNT_SLAVE(m))
> +				seq_printf(&vi->seq, "%i,",
> +					   m->mnt_master->mnt_group_id);
> +			break;
> +		case VAL_MNT_PROPAGATE_FROM:
> +			if (IS_MNT_SLAVE(m)) {
> +				int dom;
> +
> +				get_fs_root(current->fs, &root);
> +				dom = get_dominating_id(m, &root);
> +				path_put(&root);
> +				if (dom)
> +					seq_printf(&vi->seq, "%i,", dom);
> +			}
> +			break;
> +		case VAL_MNT_UNBINDABLE:
> +			if (IS_MNT_UNBINDABLE(m))
> +				seq_puts(&vi->seq, "yes");
> +			break;
> +		default:
> +			err = -ENOENT;
> +			break;
> +		}
> +		err = val_end_seq(vi, err);
> +	}
> +	val_pop_prefix(vi, oldprefix);
> +
> +	return err;
> +}
> +
> +static int val_mnt_get(struct val_iter *vi, const struct path *path)
> +{
> +	int err;
> +
> +	if (!vi->sub)
> +		return val_get_group(vi, val_mnt_group);
> +
> +	namespace_lock_read();
> +	err = val_mnt_show(vi, path->mnt);
> +	namespace_unlock_read();
> +
> +	return err;
> +}
> +
> +static int val_mntns_get(struct val_iter *vi, const struct path *path)
> +{
> +	struct mnt_namespace *mnt_ns = current->nsproxy->mnt_ns;
> +	struct vfsmount *mnt;
> +	struct path root;
> +	char *end;
> +	int mnt_id;
> +	int err;
> +
> +	if (!vi->sub) {
> +		get_fs_root(current->fs, &root);
> +		seq_mnt_list(&vi->seq, mnt_ns, &root);
> +		path_put(&root);
> +		return val_end_seq(vi, 0);
> +	}
> +
> +	end = strchr(vi->sub, VAL_GRSEP);
> +	if (end)
> +		*end = '\0';
> +	err = kstrtoint(vi->sub, 10, &mnt_id);
> +	if (err)
> +		return val_err(vi, err);
> +	vi->sub = NULL;
> +	if (end) {
> +		*end = VAL_GRSEP;
> +		vi->sub = end + 1;
> +	}
> +
> +	namespace_lock_read();
> +	get_fs_root(current->fs, &root);
> +	mnt = mnt_lookup_by_id(mnt_ns, &root, mnt_id);
> +	path_put(&root);
> +	if (!mnt) {
> +		namespace_unlock_read();
> +		return val_err(vi, -ENOENT);
> +	}
> +	if (vi->sub)
> +		err = val_mnt_show(vi, mnt);
> +	else
> +		err = val_get_group(vi, val_mnt_group);
> +
> +	namespace_unlock_read();
> +	mntput(mnt);
> +
> +	return err;
> +}
> +
> +static ssize_t val_do_read(struct val_iter *vi, struct path *path)
> +{
> +	ssize_t ret;
> +	struct file *file;
> +	struct open_flags op = {
> +		.open_flag = O_RDONLY,
> +		.acc_mode = MAY_READ,
> +		.intent = LOOKUP_OPEN,
> +	};
> +
> +	file = do_file_open_root(path, "", &op);
> +	if (IS_ERR(file))
> +		return PTR_ERR(file);
> +
> +	ret = vfs_read(file, vi->vec.iov_base, vi->vec.iov_len, NULL);
> +	fput(file);
> +
> +	return ret;
> +}
> +
> +static ssize_t val_do_readlink(struct val_iter *vi, struct path *path)
> +{
> +	int ret;
> +
> +	ret = security_inode_readlink(path->dentry);
> +	if (ret)
> +		return ret;
> +
> +	return vfs_readlink(path->dentry, vi->vec.iov_base, vi->vec.iov_len);
> +}
> +
> +static inline bool dot_or_dotdot(const char *s)
> +{
> +	return s[0] == '.' &&
> +		(s[1] == '/' || s[1] == '\0' ||
> +		 (s[1] == '.' && (s[2] == '/' || s[2] == '\0')));
> +}
> +
> +/*
> + * - empty path is okay
> + * - must not begin or end with slash or have a double slash anywhere
> + * - must not have . or .. components
> + */
> +static bool val_verify_path(const char *subpath)
> +{
> +	const char *s = subpath;
> +
> +	if (s[0] == '\0')
> +		return true;
> +
> +	if (s[0] == '/' || s[strlen(s) - 1] == '/' || strstr(s, "//"))
> +		return false;
> +
> +	for (s--; s; s = strstr(s + 3, "/."))
> +		if (dot_or_dotdot(s + 1))
> +			return false;
> +
> +	return true;
> +}
> +
> +static int val_data_get(struct val_iter *vi, const struct path *path)
> +{
> +	struct path this;
> +	ssize_t ret;
> +
> +	if (!vi->sub)
> +		return val_err(vi, -ENOENT);
> +
> +	if (!val_verify_path(vi->sub))
> +		return val_err(vi, -EINVAL);
> +
> +	ret = vfs_path_lookup(path->dentry, path->mnt, vi->sub,
> +			      LOOKUP_NO_XDEV | LOOKUP_BENEATH |
> +			      LOOKUP_IN_ROOT, &this);
> +	if (ret)
> +		return val_err(vi, ret);
> +
> +	ret = -ENODATA;
> +	if (d_is_reg(this.dentry) || d_is_symlink(this.dentry)) {
> +		if (d_is_reg(this.dentry))
> +			ret = val_do_read(vi, &this);
> +		else
> +			ret = val_do_readlink(vi, &this);
> +	}
> +	path_put(&this);
> +	if (ret == -EFAULT)
> +		return ret;
> +	if (ret < 0)
> +		return val_err(vi, ret);
> +	if (ret == vi->vec.iov_len)
> +		return -EOVERFLOW;
> +
> +	return val_end(vi, ret);
> +}
> +
> +static int val_xattr_get(struct val_iter *vi, const struct path *path)
> +{
> +	ssize_t ret;
> +	struct user_namespace *mnt_userns = mnt_user_ns(path->mnt);
> +	void *value = vi->seq.buf + vi->seq.count;
> +	size_t size = min_t(size_t, vi->seq.size - vi->seq.count,
> +			    XATTR_SIZE_MAX);
> +
> +	if (!vi->sub)
> +		return val_err(vi, -ENOENT);
> +
> +	ret = vfs_getxattr(mnt_userns, path->dentry, vi->sub, value, size);
> +	if (ret < 0)
> +		return val_err(vi, ret);
> +
> +	if ((strcmp(vi->sub, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
> +	    (strcmp(vi->sub, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
> +		posix_acl_fix_xattr_to_user(mnt_userns, value, ret);
> +
> +	vi->seq.count += ret;
> +
> +	return val_end_seq(vi, 0);
> +}
> +
> +
> +static struct val_desc val_toplevel_group[] = {
> +	{ .name = "mnt",	.get = val_mnt_get,	},
> +	{ .name = "mntns",	.get = val_mntns_get,	},
> +	{ .name = "xattr",	.get = val_xattr_get,	},
> +	{ .name = "data",	.get = val_data_get,	},
> +	{ .name = NULL },
> +};
> +
> +SYSCALL_DEFINE5(getvalues,
> +		int, dfd,
> +		const char __user *, pathname,
> +		struct name_val __user *, vec,
> +		size_t, num,
> +		unsigned int, flags)
> +{
> +	char vals[1024];
> +	struct val_iter vi = {
> +		.curr = vec,
> +		.num = num,
> +		.seq.buf = vals,
> +		.bufsize = sizeof(vals),
> +		.prefix = "",
> +	};
> +	struct val_desc *vd;
> +	struct path path = {};
> +	ssize_t err;
> +
> +	err = user_path_at(dfd, pathname, 0, &path);
> +	if (err)
> +		return err;
> +
> +	err = val_get(&vi);
> +	if (err)
> +		goto out;
> +
> +	if (!strlen(vi.name)) {
> +		err = val_get_group(&vi, val_toplevel_group);
> +		goto out;
> +	}
> +	while (!err && vi.num) {
> +		vd = val_lookup(&vi, val_toplevel_group);
> +		if (!vd->name)
> +			err = val_err(&vi, -ENOENT);
> +		else
> +			err = vd->get(&vi, &path);
> +	}
> +out:
> +	if (err == -EOVERFLOW)
> +		err = 0;
> +
> +	path_put(&path);
> +	return err < 0 ? err : num - vi.num;
> +}
> -- 
> 2.35.1
>