Hey Florian, On Mon, Aug 01, 2022 at 10:48:01AM +0200, Florian Weimer wrote: > * Jason A. Donenfeld: > > > API-wise, vDSO getrandom has a pair of functions: > > > > ssize_t getrandom(void *state, void *buffer, size_t len, unsigned int flags); > > void *getrandom_alloc([inout] size_t *num, [out] size_t *size_per_each); > > > > In the first function, the return value and the latter 3 arguments are > > the same as ordinary getrandom(), while the first argument is a pointer > > to some state allocated with getrandom_alloc(). getrandom_alloc() takes > > the desired number of states, and returns an array of states, the number > > actually allocated, and the size in bytes of each one, enabling a libc > > to use one per thread. We very intentionally do *not* leave state > > allocation up to the caller. There are too many weird things that can go > > wrong, and it's important that vDSO does not provide too generic of a > > mechanism. It's not going to store its state in just any old memory > > address. It'll do it only in ones it allocates. > > I still don't see why this couldn't be per-thread state (if you handle > fork generations somehow). That actually *is* the intent of this v2. Specifically, you call getrandom_alloc and you get an *array* of states, which you can then pass off to various threads. Since we have to allocate in page sizes, we can't do this piecemeal, so this is a mechanism for giving out chunks of them (~28 at a time), which you'd then give to threads as they're created, making more as needed. > I also think it makes sense to introduce batching for the system call > implementation first, and tie that to the vDSO acceleration. I expect a > large part of the benefit comes from the batching, not the system call > avoidance. What I understand you to mean is that *instead of* doing vDSO, we could just batch in the kernel, and reap most of the performance benefits. If that turns out to be true, and then we don't even need this vDSO stuff, I'd be really happy. So I'll give this a try. One question is where to store that batch. On the surface, per-cpu seems appealing, like what we do for get_random_u32() and such for kernel callers. But per-cpu means disabling preemption, which then becomes a problem when copying into userspace, where the copies can fault. So maybe something more sensible is, like above, just doing this per-task. I'll give it a stab and will let you know what it looks like. Jason