"Paul E. McKenney" <paulmck@xxxxxxxxxx> writes: > Hence Ulad's work on kfree_rcu(). The approach is to allocate a > page-sized array to hold all the pointers, then fill in the rest of these > pointers on each subsequent kfree_rcu() call. These arrays of pointers > also allows use of kfree_bulk() instead of kfree(), which can improve > performance yet more. It is no big deal if kfree_rcu()'s allocation > attempts fail occasionally because it can simply fall back to the old > linked-list approach. And given that the various lockless caches in > the memory allocator are almost never empty, in theory life is good. Of course, it's always the damned reality which ruins the fun. > But in practice, mainline now has CONFIG_PROVE_RAW_LOCK_NESTING, > and for good reason -- this Kconfig option makes it at least a > little bit harder for mainline developers to mess up RT. But with > CONFIG_PROVE_RAW_LOCK_NESTING=y and lockdep enabled, mainline will now > sometimes complain if you invoke kfree_rcu() while holding a raw spinlock. > This happens when kfree_rcu() needs to invoke the memory allocator and > the memory allocator's caches are empty, thus resulting in the memory > allocator attempting to acquire a non-raw spinlock. Right. > Because kfree_rcu() has a fallback available (just using the old linked > list), kfree_rcu() would work well given a way to tell the memory > allocator to return NULL instead of acquiring a non-raw spinlock. > Which is exactly what Ulad's recent patches are intended to do. That much I understood, but I somehow failed to figure the why out despite the elaborate changelog. 2 weeks of 30+C seem to have cooked my brain :) > Since then, this thread has discussed various other approaches, > including using existing combinations of GFP_ flags, converting > the allocator's zone lock to a raw spinlock, and so on. > > Does that help, or am I missing the point of your question? Yes, that helps so far that I understand what the actual problem is. It does not really help to make me more happy. :) That said, we can support atomic allocations on RT up to the point where zone->lock comes into play. We don't know yet exactly where the zone->lock induced damage happens. Presumably it's inside free_pcppages_bulk() - at least that's where I have faint bad memories from 15+ years ago. Aside of that I seriously doubt that it can be made work within a reasonable time frame. But what makes me really unhappy is that my defense line against allocations from truly atomic contexts (from RT POV) which was enforced on RT gets a real big gap shot into it. It becomes pretty hard to argue why atomic allocations via kmalloc() or kmem_cache_alloc() should be treated any different. Technically they can work similar to the page allocations up to the point where regular spinlocks come into play or the slab cache is exhausted. Where to draw the line? It's also unclear for the page allocator case whether we can and should stick a limit on the number of pages and/or the pageorder. Myself and others spent a considerable amount of time to kill off these kind of allocations from various interesting places including the guts of send IPI, the affinity setting path and others where people just slapped allocations into them because the stack checker warned or because they happened to copy the code from some other place. RT was pretty much a quick crap detector whenever new incarnations of this got added and to some extent continuous education about these issues made them less prominent over the years. Using atomic allocations should always have a real good rationale, not only in the cases where they collide with RT. I can understand your rationale and what you are trying to solve. So, if we can actually have a distinct GFP variant: GFP_I_ABSOLUTELY_HAVE_TO_DO_THAT_AND_I_KNOW_IT_CAN_FAIL_EARLY which is easy to grep for then having the page allocator go down to the point where zone lock gets involved is not the end of the world for RT in theory - unless that damned reality tells otherwise. :) The page allocator allocations should also have a limit on the number of pages and eventually also page order (need to stare at the code or let Michal educate me that the order does not matter). To make it consistent the same GFP_ variant should allow the slab allocator go to the point where the slab cache is exhausted. Having a distinct and clearly defined GFP_ variant is really key to chase down offenders and to make reviewers double check upfront why this is absolutely required. Thanks, tglx