> On May 28, 2024, at 5:25 AM, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote: > > Provide a generic C vDSO getrandom() implementation, which operates on > an opaque state returned by vgetrandom_alloc() and produces random bytes > the same way as getrandom(). This has a the API signature: > > ssize_t vgetrandom(void *buffer, size_t len, unsigned int flags, void *opaque_state); > +/** > + * type vdso_kernel_ulong - unsigned long type that matches kernel's unsigned long > + * > + * Data shared between userspace and the kernel must operate the same way in both 64-bit code and in > + * 32-bit compat code, over the same potentially 64-bit kernel. This type represents the size of an > + * unsigned long as used by kernel code. This isn't necessarily the same as an unsigned long as used > + * by userspace, however. Why is this better than using plain u64? It’s certainly more complicated. It also rather fundamentally breaks CRIU on 32-bit userspace (although CRIU may well be unable to keep vgetrandom working after a restore onto a different kernel anyway). Admittedly 32-bit userspace is a slowly dying breed, but still. > + * > + * +-------------------+-------------------+------------------+-------------------+ > + * | 32-bit userspace | 32-bit userspace | 64-bit userspace | 64-bit userspace | > + * | unsigned long | vdso_kernel_ulong | unsigned long | vdso_kernel_ulong | > + * +---------------+-------------------+-------------------+------------------+-------------------+ > + * | 32-bit kernel | ✓ same size | ✓ same size | > + * | unsigned long | | | > + * +---------------+-------------------+-------------------+------------------+-------------------+ > + * | 64-bit kernel | ✘ different size! | ✓ same size | ✓ same size | ✓ same size | > + * | unsigned long | | | | | > + * +---------------+-------------------+-------------------+------------------+-------------------+ > + */ > +#ifdef CONFIG_64BIT > +typedef u64 vdso_kernel_ulong; > +#else > +typedef u32 vdso_kernel_ulong; > +#endif > + > +#endif /* __VDSO_TYPES_H */ > diff --git a/lib/vdso/getrandom.c b/lib/vdso/getrandom.c > new file mode 100644 > index 000000000000..4d9bb59985f8 > --- /dev/null > +++ b/lib/vdso/getrandom.c > @@ -0,0 +1,226 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Copyright (C) 2022 Jason A. Donenfeld <Jason@xxxxxxxxx>. All Rights Reserved. > + */ > + > +#include <linux/cache.h> > +#include <linux/kernel.h> > +#include <linux/time64.h> > +#include <vdso/datapage.h> > +#include <vdso/getrandom.h> > +#include <asm/vdso/getrandom.h> > +#include <asm/vdso/vsyscall.h> > + > +#define MEMCPY_AND_ZERO_SRC(type, dst, src, len) do { \ > + while (len >= sizeof(type)) { \ > + __put_unaligned_t(type, __get_unaligned_t(type, src), dst); \ > + __put_unaligned_t(type, 0, src); \ > + dst += sizeof(type); \ > + src += sizeof(type); \ > + len -= sizeof(type); \ > + } \ > +} while (0) > + > +static void memcpy_and_zero_src(void *dst, void *src, size_t len) > +{ > + if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) { > + if (IS_ENABLED(CONFIG_64BIT)) > + MEMCPY_AND_ZERO_SRC(u64, dst, src, len); > + MEMCPY_AND_ZERO_SRC(u32, dst, src, len); > + MEMCPY_AND_ZERO_SRC(u16, dst, src, len); > + } > + MEMCPY_AND_ZERO_SRC(u8, dst, src, len); > +} > + > +/** > + * __cvdso_getrandom_data - Generic vDSO implementation of getrandom() syscall. > + * @rng_info: Describes state of kernel RNG, memory shared with kernel. > + * @buffer: Destination buffer to fill with random bytes. > + * @len: Size of @buffer in bytes. > + * @flags: Zero or more GRND_* flags. > + * @opaque_state: Pointer to an opaque state area. > + * > + * This implements a "fast key erasure" RNG using ChaCha20, in the same way that the kernel's > + * getrandom() syscall does. It periodically reseeds its key from the kernel's RNG, at the same > + * schedule that the kernel's RNG is reseeded. If the kernel's RNG is not ready, then this always > + * calls into the syscall. > + * > + * @opaque_state *must* be allocated using the vgetrandom_alloc() syscall. Unless external locking > + * is used, one state must be allocated per thread, as it is not safe to call this function > + * concurrently with the same @opaque_state. However, it is safe to call this using the same > + * @opaque_state that is shared between main code and signal handling code, within the same thread. > + * > + * Returns the number of random bytes written to @buffer, or a negative value indicating an error. > + */ > +static __always_inline ssize_t > +__cvdso_getrandom_data(const struct vdso_rng_data *rng_info, void *buffer, size_t len, > + unsigned int flags, void *opaque_state) I don’t love this function signature. I generally think that, if you’re going to have user code pass a pointer to kernel code, either make the buffer have a well defined, constant size or pass a length. As it stands, one cannot locally prove that user code that calls it is memory-safe. In fact, any caller that has the misfortune of running under CRIU is *not* memory safe if CRIU allows the vDSO to be preserved. Ouch. (CRIU has some special code for this. I'm not 100% clear on all the details.) One could maybe sort of get away with treating the provided opaque_state as a completely opaque value and not a pointer, but then the mechanism for allocating these states should be adjusted accordingly. One thing that occurs to me is that, if this thing were to be made CRIU-safe, the buffer could have a magic number that changes any time the data structure changes, and the vDSO could check, at the beginning and end of the call, that the magic number is correct. Doing this would require using a special VMA type instead of just a wipe-on-fork mapping, which could plausibly be a good thing anyway. (Hmm, we don't just want WIPEONFORK. We should probably also wipe on swap-out and, more importantly, we should absolutely wipe on any sort of CRIU-style checkpointing. Perhaps a special VMA would be a good thing for multiple reasons. > +{ > + ssize_t ret = min_t(size_t, INT_MAX & PAGE_MASK /* = MAX_RW_COUNT */, len); > + struct vgetrandom_state *state = opaque_state; > + size_t batch_len, nblocks, orig_len = len; > + unsigned long current_generation; > + void *orig_buffer = buffer; > + u32 counter[2] = { 0 }; > + bool in_use, have_retried = false; > + > + /* The state must not straddle a page, since pages can be zeroed at any time. */ > + if (unlikely(((unsigned long)opaque_state & ~PAGE_MASK) + sizeof(*state) > PAGE_SIZE)) > + goto fallback_syscall; This is weird. Either the provided pointer is valid or it isn’t. Reasonable outcomes are a segfault if the pointer is bad or success (or fallback if needed for some reason) if the pointer is good. Why is there specific code to catch a specific sort of pointer screwup here?