On Thu, Mar 19, 2020 at 12:05:22PM +0000, George Spelvin wrote: > The separate "rand" and "rand_count" variables could get out of > sync with bad results. > > In the worst case, two threads would see rand_count=1 and both > decrement it, resulting in rand_count=255 and rand being filled > with zeros for the next 255 calls. > > Instead, pack them both into a single, atomically updateable, > variable. This makes it a lot easier to reason about race > conditions. They are still there - the code deliberately eschews > locking - but basically harmless on the rare occasions that > they happen. > > Second, use READ_ONCE and WRITE_ONCE. Because the random bit > buffer is accessed by multiple threads concurrently without > locking, omitting those puts us deep in the land of nasal demons. > The compiler would be free to spill to the static variable in > arbitrarily perverse ways and create hard-to-find bugs. > > (I'm torn between this and just declaring the buffer "volatile". > Linux tends to prefer marking accesses rather than variables, > but in this case, every access to the buffer is volatile. > It makes no difference to the generated code.) > > Third, use long rather than u64. This not only keeps the state > atomically updateable, it also speeds up the fast path on 32-bit > machines. Saving at least three instructions on the fast path (one > load, one add-with-carry, and one store) is worth a second call > to get_random_u*() per 64 bits. The fast path of get_random_u* > is less than the 3*64 = 192 instructions saved, and the slow path > happens every 64 bytes so isn't affected by the change. > > Fourth, make the function inline. It's small, and there's only > one caller (in mm/page_alloc.c:__free_one_page()), so avoid the > function call overhead. > > Fifth, use the msbits of the buffer first (left shift) rather > than the lsbits (right shift). Testing the sign bit produces > slightly smaller/faster code than testing the lsbit. > > I've tried shifting both ways, and copying the desired bit to a > boolean before shifting rather than keeping separate full-width > r and rshift variables, but both produce larger code: > > x86-64 text size > Msbit 42236 > Lsbit 42242 (+6) > Lsbit+bool 42258 (+22) > Msbit+bool 42284 (+52) > > (Since this is straight-line code, size is a good proxy for number > of instructions and execution time. Using READ/WRITE_ONCE instead of > volatile makes no difference.) > > In a perfect world, on x86-64 the fast path would be: > shlq rand(%eip) > jz refill > refill_complete: > jc add_to_tail > > but I don't see how to get gcc to generate that, and this > function isn't worth arch-specific implementation. > > Signed-off-by: George Spelvin <lkml@xxxxxxx> > Acked-by: Kees Cook <keescook@xxxxxxxxxxxx> > Acked-by: Dan Williams <dan.j.williams@xxxxxxxxx> > Cc: Alexander Duyck <alexander.duyck@xxxxxxxxx> > Cc: Randy Dunlap <rdunlap@xxxxxxxxxxxxx> > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Cc: linux-mm@xxxxxxxxx > --- > v2: Rewrote commit message to explain existing races better. > Made local variables unsigned to avoid (technically undefined) > signed overflow. > v3: Typos fixed, Acked-by, expanded commit message. > v4: Rebase against -next; function has changed from > add_to_free_area_random() to shuffle_pick_tail. Move to > inline function in shuffle.h. > Not sure if it's okay to keep Acked-by: after such a > significant change. Good by me. Thanks! -Kees -- Kees Cook