Hello! We have an interesting issue involving interactions between RCU, memory allocation, and "raw atomic" contexts. The most attractive solution to this issue requires adding a new GFP_ flag. Perhaps this is a big ask, but on the other hand, the benefit is a large reduction in linked-list-induced cache misses when invoking RCU callbacks. For more details, please read on! Examples of raw atomic contexts include disabled hardware interrupts (that is, a hardware irq handler rather than a threaded irq handler), code holding a raw_spinlock_t, and code with preemption disabled (but only in cases where -rt cannot safely map it to disabled migration). It turns out that call_rcu() is already invoked from raw atomic contexts, and we therefore anticipate that kfree_rcu() will also be at some point. This matters due to recent work to fix a weakness in both call_rcu() and kfree_rcu() that was pointed out long ago by Christoph Lameter, among others. The weakness is that RCU traverses linked callback lists when invoking those callbacks. Because the just-ended grace period will have rendered these lists cache-cold, this results in an expensive cache miss on each and every callback invocation. Uladzislau Rezki (CCed) has recently produced patches for kfree_rcu() that instead store pointers to callbacks in arrays, so that callback invocation can step through the array using the kfree_bulk() interface. This greatly reducing the number of cache misses. The benefits are not subtle: https://lore.kernel.org/lkml/20191231122241.5702-1-urezki@xxxxxxxxx/ Of course, the arrays have to come from somewhere, and that somewhere is the memory allocator. Yes, memory allocation can fail, but in that rare case, kfree_rcu() just falls back to the old approach, taking a few extra cache misses, but making good (if expensive) forward progress. This works well until someone invokes kfree_rcu() with a raw spinlock held. Even that works fine unless the memory allocator has exhausted its caches, at which point it will acquire a normal spinlock. In kernels built with CONFIG_PROVE_RAW_LOCK_NESTING=y this will result in a lockdep splat. Worse yet, in -rt kernels, this can result in scheduling while atomic. So, may we add a GFP_ flag that will cause kmalloc() and friends to return NULL when they would otherwise need to acquire their non-raw spinlock? This avoids adding any overhead to the slab-allocator fastpaths, but allows callback invocation to reduce cache misses without having to restructure some existing callers of call_rcu() and potential future callers of kfree_rcu(). Thoughts? Thanx, Paul PS. Other avenues investigated: o Just don't invoke kmalloc() when kfree_rcu() is invoked from raw atomic contexts. The problem with this is that there is no way to detect raw atomic contexts in production kernels built with CONFIG_PREEMPT=n. Adding means to detect this would increase overhead on numerous fastpaths. o Just say "no" to invoking call_rcu() and kfree_rcu() from raw atomic contexts. This would require that the affected call_rcu() and kfree_rcu() invocations be deferred. This is in theory simple, but can get quite messy, and often requires fallbacks such as timers that can degrade energy efficiency and realtime response. o Provide a different non-allocating API such as kfree_rcu_raw() and call_rcu_raw() that are used from raw atomic contexts and also on memory-allocation failure from kfree_rcu() and call_rcu(). This results in unconditional callback-invocation cache misses for calls from raw contexts, including for common code that is only occasionally invoked from raw atomic contexts. This approach also forces developers to worry about two more RCU API members. o Move the memory allocator's spinlocks to raw_spinlock_t. This would be bad for realtime response, and would likely require even more conversions when the allocator invokes other subsystems that also use non-raw spinlocks.