Re: [PATCH v18 2/5] random: add vgetrandom_alloc() syscall

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2024-06-20, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
> The vDSO getrandom() works over an opaque per-thread state of an
> unexported size, which must be marked VM_WIPEONFORK, VM_DONTDUMP,
> VM_NORESERVE, and VM_DROPPABLE for proper operation. Over time, the
> nuances of these allocations may change or grow or even differ based on
> architectural features.
> 
> The syscall has the signature:
> 
>   void *vgetrandom_alloc(unsigned int *num, unsigned int *size_per_each,
>                          unsigned long addr, unsigned int flags);
> 
> This takes a hinted number of opaque states in `num`, and returns a
> pointer to an array of opaque states, the number actually allocated back
> in `num`, and the size in bytes of each one in `size_per_each`, enabling
> a libc to slice up the returned array into a state per each thread,
> while ensuring that no single state straddles a page boundary. (The
> `flags` and `addr` arguments, as well as the `*size_per_each` input
> value, are reserved for the future and are forced to be zero zero for
> now.)

Given how many flags are going to be reserved at the outset, what about
using an extensible struct (copy_struct_from_user) instead? If you're
absolutely sure you'll never need more arguments that's fine, but it
seems entirely possible to me that you might need an extra argument in a
few years.

Since you need to write to *num in the current syscall, I suspect the
following would be nicer as well.

   struct vgetrandom_args {
           u64 num;
   }

   void *vgetrandom_alloc(struct vgetrandom_args *arg, size_t size);

If you'd prefer to have flags from the outset (even though you could
extend them later without issues), then

   struct vgetrandom_args {
           u64 flags;
           u64 num;
   }

would also work.

Then again, I guess since libc is planned to be the primary user,
creating a new syscall in a decade if necessary is probably not that big
of an issue.

> Libc is expected to allocate a chunk of these on first use, and then
> dole them out to threads as they're created, allocating more when
> needed. The returned address of the first state may be passed to
> munmap(2) with a length of `DIV_ROUND_UP(num, PAGE_SIZE / size_per_each)
> * PAGE_SIZE`, in order to deallocate the memory.
> 
> We very intentionally do *not* leave state allocation for vDSO
> getrandom() up to userspace itself, but rather provide this new syscall
> for such allocations. vDSO getrandom() must not store its state in just
> any old memory address, but rather just ones that the kernel specially
> allocates for it, leaving the particularities of those allocations up to
> the kernel.
> 
> The allocation of states is intended to be integrated into libc's thread
> management. As an illustrative example, the following code might be used
> to do the same outside of libc. Though, vgetrandom_alloc() is not
> expected to be exposed outside of libc, and the pthread usage here is
> expected to be elided into libc internals. This allocation scheme is
> very naive and does not shrink; other implementations may choose to be
> more complex.
> 
>   static void *vgetrandom_alloc(unsigned int *num, unsigned int *size_per_each)
>   {
>     *size_per_each = 0; /* Must be zero on input. */
>     return (void *)syscall(__NR_vgetrandom_alloc, &num, &size_per_each,
>                            0 /* reserved @addr */, 0 /* reserved @flags */);
>   }
> 
>   static struct {
>     pthread_mutex_t lock;
>     void **states;
>     size_t len, cap, size_per_each;
>   } grnd_allocator = {
>     .lock = PTHREAD_MUTEX_INITIALIZER
>   };
> 
>   static void *vgetrandom_get_state(void)
>   {
>     void *state = NULL;
> 
>     pthread_mutex_lock(&grnd_allocator.lock);
>     if (!grnd_allocator.len) {
>       size_t new_cap;
>       size_t page_size = getpagesize();
>       unsigned int num = sysconf(_SC_NPROCESSORS_ONLN); /* Could be arbitrary, just a hint. */
>       unsigned int size_per_each;
>       void *new_block = vgetrandom_alloc(&num, &size_per_each);
>       void *new_states;
> 
>       if (new_block == MAP_FAILED)
>         goto out;
>       if (grnd_allocator.size_per_each && grnd_allocator.size_per_each != size_per_each)
>         goto unmap;
>       grnd_allocator.size_per_each = size_per_each;
>       new_cap = grnd_allocator.cap + num;
>       new_states = reallocarray(grnd_allocator.states, new_cap, sizeof(*grnd_allocator.states));
>       if (!new_states)
>         goto unmap;
>       grnd_allocator.cap = new_cap;
>       grnd_allocator.states = new_states;
> 
>       for (size_t i = 0; i < num; ++i) {
>         grnd_allocator.states[i] = new_block;
>         if (((uintptr_t)new_block & (page_size - 1)) + size_per_each > page_size)
>           new_block = (void *)(((uintptr_t)new_block + page_size) & (page_size - 1));
>         else
>           new_block += size_per_each;
>       }
>       grnd_allocator.len = num;
>       goto success;
> 
>     unmap:
>       munmap(new_block, DIV_ROUND_UP(num, page_size / size_per_each) * page_size);
>       goto out;
>     }
>   success:
>     state = grnd_allocator.states[--grnd_allocator.len];
> 
>   out:
>     pthread_mutex_unlock(&grnd_allocator.lock);
>     return state;
>   }
> 
>   static void vgetrandom_put_state(void *state)
>   {
>     if (!state)
>       return;
>     pthread_mutex_lock(&grnd_allocator.lock);
>     grnd_allocator.states[grnd_allocator.len++] = state;
>     pthread_mutex_unlock(&grnd_allocator.lock);
>   }
> 
> Signed-off-by: Jason A. Donenfeld <Jason@xxxxxxxxx>
> ---
>  MAINTAINERS              |   1 +
>  drivers/char/random.c    | 135 ++++++++++++++++++++++++++++++++++++++-
>  include/linux/syscalls.h |   3 +
>  include/vdso/getrandom.h |  16 +++++
>  kernel/sys_ni.c          |   3 +
>  lib/vdso/Kconfig         |   6 ++
>  6 files changed, 163 insertions(+), 1 deletion(-)
>  create mode 100644 include/vdso/getrandom.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 8aa17e515ef3..8480c4c39915 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -18747,6 +18747,7 @@ T:	git https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git
>  F:	Documentation/devicetree/bindings/rng/microsoft,vmgenid.yaml
>  F:	drivers/char/random.c
>  F:	drivers/virt/vmgenid.c
> +F:	include/vdso/getrandom.h
>  
>  RAPIDIO SUBSYSTEM
>  M:	Matt Porter <mporter@xxxxxxxxxxxxxxxxxxx>
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index 2597cb43f438..ccb35f390c85 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -1,6 +1,6 @@
>  // SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
>  /*
> - * Copyright (C) 2017-2022 Jason A. Donenfeld <Jason@xxxxxxxxx>. All Rights Reserved.
> + * Copyright (C) 2017-2024 Jason A. Donenfeld <Jason@xxxxxxxxx>. All Rights Reserved.
>   * Copyright Matt Mackall <mpm@xxxxxxxxxxx>, 2003, 2004, 2005
>   * Copyright Theodore Ts'o, 1994, 1995, 1996, 1997, 1998, 1999. All rights reserved.
>   *
> @@ -8,6 +8,7 @@
>   * into roughly six sections, each with a section header:
>   *
>   *   - Initialization and readiness waiting.
> + *   - vDSO support helpers.
>   *   - Fast key erasure RNG, the "crng".
>   *   - Entropy accumulation and extraction routines.
>   *   - Entropy collection routines.
> @@ -39,6 +40,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/interrupt.h>
>  #include <linux/mm.h>
> +#include <linux/mman.h>
>  #include <linux/nodemask.h>
>  #include <linux/spinlock.h>
>  #include <linux/kthread.h>
> @@ -56,6 +58,9 @@
>  #include <linux/sched/isolation.h>
>  #include <crypto/chacha.h>
>  #include <crypto/blake2s.h>
> +#ifdef CONFIG_VDSO_GETRANDOM
> +#include <vdso/getrandom.h>
> +#endif
>  #include <asm/archrandom.h>
>  #include <asm/processor.h>
>  #include <asm/irq.h>
> @@ -169,6 +174,134 @@ int __cold execute_with_initialized_rng(struct notifier_block *nb)
>  				__func__, (void *)_RET_IP_, crng_init)
>  
>  
> +
> +/********************************************************************
> + *
> + * vDSO support helpers.
> + *
> + * The actual vDSO function is defined over in lib/vdso/getrandom.c,
> + * but this section contains the kernel-mode helpers to support that.
> + *
> + ********************************************************************/
> +
> +#ifdef CONFIG_VDSO_GETRANDOM
> +/**
> + * sys_vgetrandom_alloc - Allocate opaque states for use with vDSO getrandom().
> + *
> + * @num:	   On input, a pointer to a suggested hint of how many states to
> + * 		   allocate, and on return the number of states actually allocated.
> + *
> + * @size_per_each: On input, must be zero. On return, the size of each state allocated,
> + * 		   so that the caller can split up the returned allocation into
> + * 		   individual states.
> + *
> + * @addr:	   Reserved, must be zero.
> + *
> + * @flags:	   Reserved, must be zero.
> + *
> + * The getrandom() vDSO function in userspace requires an opaque state, which
> + * this function allocates by mapping a certain number of special pages into
> + * the calling process. It takes a hint as to the number of opaque states
> + * desired, and provides the caller with the number of opaque states actually
> + * allocated, the size of each one in bytes, and the address of the first
> + * state, which may be split up into @num states of @size_per_each bytes each,
> + * by adding @size_per_each to the returned first state @num times, while
> + * ensuring that no single state straddles a page boundary.
> + *
> + * Returns the address of the first state in the allocation on success, or a
> + * negative error value on failure.
> + *
> + * The returned address of the first state may be passed to munmap(2) with a
> + * length of `DIV_ROUND_UP(num, PAGE_SIZE / size_per_each) * PAGE_SIZE`, in
> + * order to deallocate the memory, after which it is invalid to pass it to vDSO
> + * getrandom().
> + *
> + * States allocated by this function must not be dereferenced, written, read,
> + * or otherwise manipulated. The *only* supported operations are:
> + *   - Splitting up the states in intervals of @size_per_each, no more than
> + *     @num times from the first state, while ensuring that no single state
> + *     straddles a page boundary.
> + *   - Passing a state to the getrandom() vDSO function's @opaque_state
> + *     parameter, but not passing the same state at the same time to two such
> + *     calls.
> + *   - Passing the first state and the total length to munmap(2), as described
> + *     above.
> + * All other uses are undefined behavior, which is subject to change or removal.
> + */
> +SYSCALL_DEFINE4(vgetrandom_alloc, unsigned int __user *, num,
> +		unsigned int __user *, size_per_each, unsigned long, addr,
> +		unsigned int, flags)
> +{
> +	size_t state_size, alloc_size, num_states;
> +	unsigned long pages_addr, populate;
> +	unsigned int num_hint;
> +	vm_flags_t vm_flags;
> +	int ret;
> +
> +	/*
> +	 * @flags and @addr are currently unused, so in order to reserve them
> +	 * for the future, force them to be set to zero by current callers.
> +	 */
> +	if (flags || addr)
> +		return -EINVAL;
> +
> +	/*
> +	 * Also enforce that *size_per_each is zero on input, in case this becomes
> +	 * useful later on.
> +	 */
> +	if (get_user(num_hint, size_per_each))
> +		return -EFAULT;
> +	if (num_hint)
> +		return -EINVAL;
> +
> +	if (get_user(num_hint, num))
> +		return -EFAULT;
> +
> +	state_size = sizeof(struct vgetrandom_state);
> +	num_states = clamp_t(size_t, num_hint, 1, (SIZE_MAX & PAGE_MASK) / state_size);
> +	alloc_size = PAGE_ALIGN(num_states * state_size);
> +	/*
> +	 * States cannot straddle page boundaries, so calculate the number of
> +	 * states that can fit inside of a page without being split, and then
> +	 * multiply that out by the number of pages allocated.
> +	 */
> +	num_states = (PAGE_SIZE / state_size) * (alloc_size / PAGE_SIZE);
> +
> +	vm_flags =
> +		/*
> +		 * Don't allow state to be written to swap, to preserve forward secrecy.
> +		 * But also don't mlock it or pre-reserve it, and allow it to
> +		 * be discarded under memory pressure. If no memory is available, returns
> +		 * zeros rather than segfaulting.
> +		 */
> +		VM_DROPPABLE | VM_NORESERVE |
> +
> +		/* Don't allow the state to survive forks, to prevent random number re-use. */
> +		VM_WIPEONFORK |
> +
> +		/* Don't write random state into coredumps. */
> +		VM_DONTDUMP;
> +
> +	if (mmap_write_lock_killable(current->mm))
> +		return -EINTR;
> +	pages_addr = do_mmap(NULL, 0, alloc_size, PROT_READ | PROT_WRITE,
> +			     MAP_PRIVATE | MAP_ANONYMOUS, vm_flags, 0, &populate, NULL);
> +	mmap_write_unlock(current->mm);
> +	if (IS_ERR_VALUE(pages_addr))
> +		return pages_addr;
> +
> +	ret = -EFAULT;
> +	if (put_user(num_states, num) || put_user(state_size, size_per_each))
> +		goto err_unmap;
> +
> +	return pages_addr;
> +
> +err_unmap:
> +	vm_munmap(pages_addr, alloc_size);
> +	return ret;
> +}
> +#endif
> +
>  /*********************************************************************
>   *
>   * Fast key erasure RNG, the "crng".
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 9104952d323d..56368ea4f510 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -906,6 +906,9 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
>  			    void __user *uargs);
>  asmlinkage long sys_getrandom(char __user *buf, size_t count,
>  			      unsigned int flags);
> +asmlinkage long sys_vgetrandom_alloc(unsigned int __user *num,
> +				     unsigned int __user *size_per_each,
> +				     unsigned long addr, unsigned int flags);
>  asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags);
>  asmlinkage long sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
>  asmlinkage long sys_execveat(int dfd, const char __user *filename,
> diff --git a/include/vdso/getrandom.h b/include/vdso/getrandom.h
> new file mode 100644
> index 000000000000..69037519d20b
> --- /dev/null
> +++ b/include/vdso/getrandom.h
> @@ -0,0 +1,16 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2022-2024 Jason A. Donenfeld <Jason@xxxxxxxxx>. All Rights Reserved.
> + */
> +
> +#ifndef _VDSO_GETRANDOM_H
> +#define _VDSO_GETRANDOM_H
> +
> +/**
> + * struct vgetrandom_state - State used by vDSO getrandom() and allocated by vgetrandom_alloc().
> + *
> + * Currently empty, as the vDSO getrandom() function has not yet been implemented.
> + */
> +struct vgetrandom_state { int placeholder; };
> +
> +#endif /* _VDSO_GETRANDOM_H */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index d7eee421d4bc..6b17fadb0f59 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -272,6 +272,9 @@ COND_SYSCALL(pkey_free);
>  /* memfd_secret */
>  COND_SYSCALL(memfd_secret);
>  
> +/* random */
> +COND_SYSCALL(vgetrandom_alloc);
> +
>  /*
>   * Architecture specific weak syscall entries.
>   */
> diff --git a/lib/vdso/Kconfig b/lib/vdso/Kconfig
> index c46c2300517c..99661b731834 100644
> --- a/lib/vdso/Kconfig
> +++ b/lib/vdso/Kconfig
> @@ -38,3 +38,9 @@ config GENERIC_VDSO_OVERFLOW_PROTECT
>  	  in the hotpath.
>  
>  endif
> +
> +config VDSO_GETRANDOM
> +	bool
> +	select NEED_VM_DROPPABLE
> +	help
> +	  Selected by architectures that support vDSO getrandom().
> -- 
> 2.45.2
> 
> 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux