Re: [PATCH v2] arc4random: simplify design for better safety

"Jason A. Donenfeld" <Jason@xxxxxxxxx> · Tue, 26 Jul 2022 13:04:25 +0200

Hi Florian,

On Tue, Jul 26, 2022 at 11:55:23AM +0200, Florian Weimer wrote:
> * Jason A. Donenfeld:
> 
> > +      pfd.fd = TEMP_FAILURE_RETRY (
> > +	  __open64_nocancel ("/dev/random", O_RDONLY | O_CLOEXEC | O_NOCTTY));
> > +      if (pfd.fd < 0)
> > +	arc4random_getrandom_failure ();
> > +      if (__poll (&pfd, 1, -1) < 0)
> > +	arc4random_getrandom_failure ();
> > +      if (__close_nocancel (pfd.fd) < 0)
> > +	arc4random_getrandom_failure ();
> 
> What happens if /dev/random is actually /dev/urandom?  Will the poll
> call fail?

Yes. I'm unsure if you're asking this because it'd be a nice
simplification to only have to open one fd, or because you're worried
about confusion. I don't think the confusion problem is one we should
take too seriously, but if you're concerned, we can always fstat and
check the maj/min. Seems a bit much, though.

> I think we need a no-cancel variant of poll here, and we also need to
> handle EINTR gracefully.

Thanks for the note about poll nocancel. I'll try to add this. I don't
totally know how to manage that pluming, but I'll give it my best shot.

> Performance-wise, my 1000 element shuffle benchmark runs about 14 times
> slower without userspace buffering.  (For comparison, just removing
> ChaCha20 while keeping a 256-byte buffer makes it run roughly 25% slower
> than current master.)  Our random() implementation is quite slow, so
> arc4random() as a replacement call is competitive.  The unbuffered
> version, not so much.

Yes, as mentioned, this is slower. But let's get something down first
that's *correct*, and then after we can start optimizing it. Let's not
prematurely optimize and create a problematic function that nobody
should use.

> Running the benchmark, I see 40% of the time spent in chacha_permute in
> the kernel, that is really quite odd.  Why doesn't the system call
> overhead dominate?

Huh, that is interesting. I guess if you're reading 4 bytes for an
integer, it winds up computing a whole chacha block each time, with half
of it doing fast key erasure and half of it being returnable to the
caller. When we later figure out a safer way to buffer, ostensibly this
will go away. But for now, we really should not prematurely optimize.

I'll have v3 out shortly with your suggested fixes.

Jason