On 2024-06-20, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote: > The vDSO getrandom() works over an opaque per-thread state of an > unexported size, which must be marked VM_WIPEONFORK, VM_DONTDUMP, > VM_NORESERVE, and VM_DROPPABLE for proper operation. Over time, the > nuances of these allocations may change or grow or even differ based on > architectural features. > > The syscall has the signature: > > void *vgetrandom_alloc(unsigned int *num, unsigned int *size_per_each, > unsigned long addr, unsigned int flags); > > This takes a hinted number of opaque states in `num`, and returns a > pointer to an array of opaque states, the number actually allocated back > in `num`, and the size in bytes of each one in `size_per_each`, enabling > a libc to slice up the returned array into a state per each thread, > while ensuring that no single state straddles a page boundary. (The > `flags` and `addr` arguments, as well as the `*size_per_each` input > value, are reserved for the future and are forced to be zero zero for > now.) Given how many flags are going to be reserved at the outset, what about using an extensible struct (copy_struct_from_user) instead? If you're absolutely sure you'll never need more arguments that's fine, but it seems entirely possible to me that you might need an extra argument in a few years. Since you need to write to *num in the current syscall, I suspect the following would be nicer as well. struct vgetrandom_args { u64 num; } void *vgetrandom_alloc(struct vgetrandom_args *arg, size_t size); If you'd prefer to have flags from the outset (even though you could extend them later without issues), then struct vgetrandom_args { u64 flags; u64 num; } would also work. Then again, I guess since libc is planned to be the primary user, creating a new syscall in a decade if necessary is probably not that big of an issue. > Libc is expected to allocate a chunk of these on first use, and then > dole them out to threads as they're created, allocating more when > needed. The returned address of the first state may be passed to > munmap(2) with a length of `DIV_ROUND_UP(num, PAGE_SIZE / size_per_each) > * PAGE_SIZE`, in order to deallocate the memory. > > We very intentionally do *not* leave state allocation for vDSO > getrandom() up to userspace itself, but rather provide this new syscall > for such allocations. vDSO getrandom() must not store its state in just > any old memory address, but rather just ones that the kernel specially > allocates for it, leaving the particularities of those allocations up to > the kernel. > > The allocation of states is intended to be integrated into libc's thread > management. As an illustrative example, the following code might be used > to do the same outside of libc. Though, vgetrandom_alloc() is not > expected to be exposed outside of libc, and the pthread usage here is > expected to be elided into libc internals. This allocation scheme is > very naive and does not shrink; other implementations may choose to be > more complex. > > static void *vgetrandom_alloc(unsigned int *num, unsigned int *size_per_each) > { > *size_per_each = 0; /* Must be zero on input. */ > return (void *)syscall(__NR_vgetrandom_alloc, &num, &size_per_each, > 0 /* reserved @addr */, 0 /* reserved @flags */); > } > > static struct { > pthread_mutex_t lock; > void **states; > size_t len, cap, size_per_each; > } grnd_allocator = { > .lock = PTHREAD_MUTEX_INITIALIZER > }; > > static void *vgetrandom_get_state(void) > { > void *state = NULL; > > pthread_mutex_lock(&grnd_allocator.lock); > if (!grnd_allocator.len) { > size_t new_cap; > size_t page_size = getpagesize(); > unsigned int num = sysconf(_SC_NPROCESSORS_ONLN); /* Could be arbitrary, just a hint. */ > unsigned int size_per_each; > void *new_block = vgetrandom_alloc(&num, &size_per_each); > void *new_states; > > if (new_block == MAP_FAILED) > goto out; > if (grnd_allocator.size_per_each && grnd_allocator.size_per_each != size_per_each) > goto unmap; > grnd_allocator.size_per_each = size_per_each; > new_cap = grnd_allocator.cap + num; > new_states = reallocarray(grnd_allocator.states, new_cap, sizeof(*grnd_allocator.states)); > if (!new_states) > goto unmap; > grnd_allocator.cap = new_cap; > grnd_allocator.states = new_states; > > for (size_t i = 0; i < num; ++i) { > grnd_allocator.states[i] = new_block; > if (((uintptr_t)new_block & (page_size - 1)) + size_per_each > page_size) > new_block = (void *)(((uintptr_t)new_block + page_size) & (page_size - 1)); > else > new_block += size_per_each; > } > grnd_allocator.len = num; > goto success; > > unmap: > munmap(new_block, DIV_ROUND_UP(num, page_size / size_per_each) * page_size); > goto out; > } > success: > state = grnd_allocator.states[--grnd_allocator.len]; > > out: > pthread_mutex_unlock(&grnd_allocator.lock); > return state; > } > > static void vgetrandom_put_state(void *state) > { > if (!state) > return; > pthread_mutex_lock(&grnd_allocator.lock); > grnd_allocator.states[grnd_allocator.len++] = state; > pthread_mutex_unlock(&grnd_allocator.lock); > } > > Signed-off-by: Jason A. Donenfeld <Jason@xxxxxxxxx> > --- > MAINTAINERS | 1 + > drivers/char/random.c | 135 ++++++++++++++++++++++++++++++++++++++- > include/linux/syscalls.h | 3 + > include/vdso/getrandom.h | 16 +++++ > kernel/sys_ni.c | 3 + > lib/vdso/Kconfig | 6 ++ > 6 files changed, 163 insertions(+), 1 deletion(-) > create mode 100644 include/vdso/getrandom.h > > diff --git a/MAINTAINERS b/MAINTAINERS > index 8aa17e515ef3..8480c4c39915 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -18747,6 +18747,7 @@ T: git https://git.kernel.org/pub/scm/linux/kernel/git/crng/random.git > F: Documentation/devicetree/bindings/rng/microsoft,vmgenid.yaml > F: drivers/char/random.c > F: drivers/virt/vmgenid.c > +F: include/vdso/getrandom.h > > RAPIDIO SUBSYSTEM > M: Matt Porter <mporter@xxxxxxxxxxxxxxxxxxx> > diff --git a/drivers/char/random.c b/drivers/char/random.c > index 2597cb43f438..ccb35f390c85 100644 > --- a/drivers/char/random.c > +++ b/drivers/char/random.c > @@ -1,6 +1,6 @@ > // SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) > /* > - * Copyright (C) 2017-2022 Jason A. Donenfeld <Jason@xxxxxxxxx>. All Rights Reserved. > + * Copyright (C) 2017-2024 Jason A. Donenfeld <Jason@xxxxxxxxx>. All Rights Reserved. > * Copyright Matt Mackall <mpm@xxxxxxxxxxx>, 2003, 2004, 2005 > * Copyright Theodore Ts'o, 1994, 1995, 1996, 1997, 1998, 1999. All rights reserved. > * > @@ -8,6 +8,7 @@ > * into roughly six sections, each with a section header: > * > * - Initialization and readiness waiting. > + * - vDSO support helpers. > * - Fast key erasure RNG, the "crng". > * - Entropy accumulation and extraction routines. > * - Entropy collection routines. > @@ -39,6 +40,7 @@ > #include <linux/blkdev.h> > #include <linux/interrupt.h> > #include <linux/mm.h> > +#include <linux/mman.h> > #include <linux/nodemask.h> > #include <linux/spinlock.h> > #include <linux/kthread.h> > @@ -56,6 +58,9 @@ > #include <linux/sched/isolation.h> > #include <crypto/chacha.h> > #include <crypto/blake2s.h> > +#ifdef CONFIG_VDSO_GETRANDOM > +#include <vdso/getrandom.h> > +#endif > #include <asm/archrandom.h> > #include <asm/processor.h> > #include <asm/irq.h> > @@ -169,6 +174,134 @@ int __cold execute_with_initialized_rng(struct notifier_block *nb) > __func__, (void *)_RET_IP_, crng_init) > > > + > +/******************************************************************** > + * > + * vDSO support helpers. > + * > + * The actual vDSO function is defined over in lib/vdso/getrandom.c, > + * but this section contains the kernel-mode helpers to support that. > + * > + ********************************************************************/ > + > +#ifdef CONFIG_VDSO_GETRANDOM > +/** > + * sys_vgetrandom_alloc - Allocate opaque states for use with vDSO getrandom(). > + * > + * @num: On input, a pointer to a suggested hint of how many states to > + * allocate, and on return the number of states actually allocated. > + * > + * @size_per_each: On input, must be zero. On return, the size of each state allocated, > + * so that the caller can split up the returned allocation into > + * individual states. > + * > + * @addr: Reserved, must be zero. > + * > + * @flags: Reserved, must be zero. > + * > + * The getrandom() vDSO function in userspace requires an opaque state, which > + * this function allocates by mapping a certain number of special pages into > + * the calling process. It takes a hint as to the number of opaque states > + * desired, and provides the caller with the number of opaque states actually > + * allocated, the size of each one in bytes, and the address of the first > + * state, which may be split up into @num states of @size_per_each bytes each, > + * by adding @size_per_each to the returned first state @num times, while > + * ensuring that no single state straddles a page boundary. > + * > + * Returns the address of the first state in the allocation on success, or a > + * negative error value on failure. > + * > + * The returned address of the first state may be passed to munmap(2) with a > + * length of `DIV_ROUND_UP(num, PAGE_SIZE / size_per_each) * PAGE_SIZE`, in > + * order to deallocate the memory, after which it is invalid to pass it to vDSO > + * getrandom(). > + * > + * States allocated by this function must not be dereferenced, written, read, > + * or otherwise manipulated. The *only* supported operations are: > + * - Splitting up the states in intervals of @size_per_each, no more than > + * @num times from the first state, while ensuring that no single state > + * straddles a page boundary. > + * - Passing a state to the getrandom() vDSO function's @opaque_state > + * parameter, but not passing the same state at the same time to two such > + * calls. > + * - Passing the first state and the total length to munmap(2), as described > + * above. > + * All other uses are undefined behavior, which is subject to change or removal. > + */ > +SYSCALL_DEFINE4(vgetrandom_alloc, unsigned int __user *, num, > + unsigned int __user *, size_per_each, unsigned long, addr, > + unsigned int, flags) > +{ > + size_t state_size, alloc_size, num_states; > + unsigned long pages_addr, populate; > + unsigned int num_hint; > + vm_flags_t vm_flags; > + int ret; > + > + /* > + * @flags and @addr are currently unused, so in order to reserve them > + * for the future, force them to be set to zero by current callers. > + */ > + if (flags || addr) > + return -EINVAL; > + > + /* > + * Also enforce that *size_per_each is zero on input, in case this becomes > + * useful later on. > + */ > + if (get_user(num_hint, size_per_each)) > + return -EFAULT; > + if (num_hint) > + return -EINVAL; > + > + if (get_user(num_hint, num)) > + return -EFAULT; > + > + state_size = sizeof(struct vgetrandom_state); > + num_states = clamp_t(size_t, num_hint, 1, (SIZE_MAX & PAGE_MASK) / state_size); > + alloc_size = PAGE_ALIGN(num_states * state_size); > + /* > + * States cannot straddle page boundaries, so calculate the number of > + * states that can fit inside of a page without being split, and then > + * multiply that out by the number of pages allocated. > + */ > + num_states = (PAGE_SIZE / state_size) * (alloc_size / PAGE_SIZE); > + > + vm_flags = > + /* > + * Don't allow state to be written to swap, to preserve forward secrecy. > + * But also don't mlock it or pre-reserve it, and allow it to > + * be discarded under memory pressure. If no memory is available, returns > + * zeros rather than segfaulting. > + */ > + VM_DROPPABLE | VM_NORESERVE | > + > + /* Don't allow the state to survive forks, to prevent random number re-use. */ > + VM_WIPEONFORK | > + > + /* Don't write random state into coredumps. */ > + VM_DONTDUMP; > + > + if (mmap_write_lock_killable(current->mm)) > + return -EINTR; > + pages_addr = do_mmap(NULL, 0, alloc_size, PROT_READ | PROT_WRITE, > + MAP_PRIVATE | MAP_ANONYMOUS, vm_flags, 0, &populate, NULL); > + mmap_write_unlock(current->mm); > + if (IS_ERR_VALUE(pages_addr)) > + return pages_addr; > + > + ret = -EFAULT; > + if (put_user(num_states, num) || put_user(state_size, size_per_each)) > + goto err_unmap; > + > + return pages_addr; > + > +err_unmap: > + vm_munmap(pages_addr, alloc_size); > + return ret; > +} > +#endif > + > /********************************************************************* > * > * Fast key erasure RNG, the "crng". > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index 9104952d323d..56368ea4f510 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -906,6 +906,9 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags, > void __user *uargs); > asmlinkage long sys_getrandom(char __user *buf, size_t count, > unsigned int flags); > +asmlinkage long sys_vgetrandom_alloc(unsigned int __user *num, > + unsigned int __user *size_per_each, > + unsigned long addr, unsigned int flags); > asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags); > asmlinkage long sys_bpf(int cmd, union bpf_attr *attr, unsigned int size); > asmlinkage long sys_execveat(int dfd, const char __user *filename, > diff --git a/include/vdso/getrandom.h b/include/vdso/getrandom.h > new file mode 100644 > index 000000000000..69037519d20b > --- /dev/null > +++ b/include/vdso/getrandom.h > @@ -0,0 +1,16 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* > + * Copyright (C) 2022-2024 Jason A. Donenfeld <Jason@xxxxxxxxx>. All Rights Reserved. > + */ > + > +#ifndef _VDSO_GETRANDOM_H > +#define _VDSO_GETRANDOM_H > + > +/** > + * struct vgetrandom_state - State used by vDSO getrandom() and allocated by vgetrandom_alloc(). > + * > + * Currently empty, as the vDSO getrandom() function has not yet been implemented. > + */ > +struct vgetrandom_state { int placeholder; }; > + > +#endif /* _VDSO_GETRANDOM_H */ > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index d7eee421d4bc..6b17fadb0f59 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -272,6 +272,9 @@ COND_SYSCALL(pkey_free); > /* memfd_secret */ > COND_SYSCALL(memfd_secret); > > +/* random */ > +COND_SYSCALL(vgetrandom_alloc); > + > /* > * Architecture specific weak syscall entries. > */ > diff --git a/lib/vdso/Kconfig b/lib/vdso/Kconfig > index c46c2300517c..99661b731834 100644 > --- a/lib/vdso/Kconfig > +++ b/lib/vdso/Kconfig > @@ -38,3 +38,9 @@ config GENERIC_VDSO_OVERFLOW_PROTECT > in the hotpath. > > endif > + > +config VDSO_GETRANDOM > + bool > + select NEED_VM_DROPPABLE > + help > + Selected by architectures that support vDSO getrandom(). > -- > 2.45.2 > > -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/>
Attachment:
signature.asc
Description: PGP signature