Hi Florian, On Wed, Nov 30, 2022 at 11:51:59AM +0100, Florian Weimer wrote: > * Jason A. Donenfeld: > > > +#ifdef CONFIG_VGETRANDOM_ALLOC_SYSCALL > > +/** > > + * vgetrandom_alloc - allocate opaque states for use with vDSO getrandom(). > > + * > > + * @num: on input, a pointer to a suggested hint of how many states to > > + * allocate, and on output the number of states actually allocated. > > Should userspace call this system call again if it needs more states? > The interface description doesn't make this clear. Yes. And indeed that's what Adhemerval's patch does. > > > + * @size_per_each: the size of each state allocated, so that the caller can > > + * split up the returned allocation into individual states. > > + * > > + * @flags: currently always zero. > > + * > > + * The getrandom() vDSO function in userspace requires an opaque state, which > > + * this function allocates by mapping a certain number of special pages into > > + * the calling process. It takes a hint as to the number of opaque states > > + * desired, and provides the caller with the number of opaque states actually > > + * allocated, the size of each one in bytes, and the address of the first > > + * state. > > + > > + * Returns a pointer to the first state in the allocation. > > + * > > + */ > > How do we deallocate this memory? Must it remain permanently allocated? It can be deallocated with munmap. > Can userspace use the memory for something else if it's not passed to > getrandom? I suspect the documentation answer here is, "no", even if technically it might happen to work on this kernel or that kernel. I suppose this could even be quasi-enforced by xoring the top bits with some vdso compile-time constant, so you can't rely on being able to dereference it yourself. > The separate system call strongly suggests that the > allocation is completely owned by the kernel, but there isn't > documentation here how the allocation life-cycle is supposed to look > like. In particular, it is not clear if vgetrandom_alloc or getrandom > could retain a reference to the allocation in a future implementation of > these interfaces. > > Some users might want to zap the memory for extra hardening after use, > and it's not clear if that's allowed, either. I don't think zapping that memory is supported, or even a sensible thing to do. In the first place, I don't think we should suggest that the user can dereference that pointer, at all. In that sense, maybe it's best to call it a "handle" or something similar (a "HANDLE"! a "HWND"? a "HRNG"? just kidding). In the second place, the fast erasure aspect of this means that such hardening would have no effect -- the key is overwritten after using for forward secrecy, anyway, and batched bytes are zeroed. (There is a corner case that might make it interesting to wipe in the parent, not just the child, on fork, but that's sort of a separate matter and would ideally be handled by kernel space anyway.) > If there's no registration of the allocation, it's not clear why we need > a separate system call for this. From a documentation perspective, it > may be easier to describe proper use of the getrandom vDSO call if > ownership resides with userspace. No, absolutely not, for the multiple reasons already listed in the commit messages and cover letter and previous emails. But you seem aware of this: > But it will constrain future > evolution of the implementation because you can't add registration > (retaining a reference to the passed-in area in getrandom) after the > fact. But I'm not sure if this is possible with the current interface, > either. Userspace has to make some assumptions about the life-cycle to > avoid a memory leak on thread exit. It sounds like this is sort of a different angle on Rasmus' earlier comment about how munmap leaks implementation details. Maybe there's something to that after all? Or not? I see two approaches: 1) Keep munmap as the allocation function. If later on we do fancy registration and in-kernel state tracking, or add fancy protection flags, or whatever else, munmap should be able to identify these pages and carry out whatever special treatment is necessary. 2) Convert vgetrandom_alloc() into a clone3-style syscall, as Christian suggested earlier, which might allow for a bit more overloading capability. That would be a struct that looks like: struct vgetrandom_alloc_args { __aligned_u64 flags; __aligned_u64 states; __aligned_u64 num; __aligned_u64 size_of_each; } - If flags is VGRA_ALLOCATE, states and size_of_each must be zero on input, while num is the hint, as is the case now. On output, states, size_of_each, and num are filled in. - If flags is VGRA_DEALLOCATE, states, size_of_each, and num must be as they were originally, and then it deallocates. I suppose (2) would alleviate your concerns entirely, without future uncertainty over what it'd be like to add special cases to munmap(). And it'd add a bit more future proofing to the syscall, depending on what we do. So maybe I'm warming up to that approach a bit. > > + num_states = clamp_t(size_t, num_hint, 1, (SIZE_MAX & PAGE_MASK) / state_size); > > + alloc_size = PAGE_ALIGN(num_states * state_size); > > Doesn't this waste space for one state if state_size happens to be a > power of 2? Why do this SIZE_MAX & PAGE_MASK thing at all? Shouldn't > it be PAGE_SIZE / state_size? The first line is a clamp. That fixes num_hint between 1 and the largest number that when multiplied and rounded up won't overflow. So, if state_size is a power of two, let's say 256, and there's only one state, here's what that looks like: num_states = clamp(1, 1, (0xffffffff & (~(4096 - 1))) / 256 = 16777200) = 1 alloc_size = PAGE_ALIGN(1 * 256) = 4096 So that seems like it's working as intended, right? Or if not, maybe it'd help to write out the digits you're concerned about? > > + if (put_user(alloc_size / state_size, num) || put_user(state_size, size_per_each)) > > + return -EFAULT; > > + > > + pages_addr = vm_mmap(NULL, 0, alloc_size, PROT_READ | PROT_WRITE, > > + MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, 0); > > I think Rasmus has already raised questions about MAP_LOCKED. > > I think the kernel cannot rely on it because userspace could call > munlock on the allocation. Then they're caught holding the bag? This doesn't seem much different from userspace shooting themselves in general, like writing garbage into the allocated states and then trying to use them. If this is something you really, really are concerned about, then maybe my cheesy dumb xor thing mentioned above would be a low effort mitigation here. Jason