On Sat, 2021-07-24 at 00:39 +0200, Vlastimil Babka wrote: > On 7/21/21 11:33 AM, Mike Galbraith wrote: > > On Wed, 2021-07-21 at 10:44 +0200, Vlastimil Babka wrote: > > > > > > So this doesn't look like our put_cpu_partial() preempted a > > > __slab_alloc() on the same cpu, right? > > > > No, likely it was the one preempted by someone long gone, but we'll > > never know without setting a trap. > > > > > BTW did my ugly patch work? > > > > Nope. I guess you missed my reporting it to have been a -ENOBOOT, and > > Indeed, I misunderstood it as you talking about your patch. > > > that cutting it in half, ie snagging only __slab_free() does boot, and > > seems to cure all of the RT fireworks. > > OK, so depending on drain=1 makes this apply only to put_cpu_partial() > called from __slab_free and not get_partial_node(). One notable > difference is that in __slab_free we don't have n->list_lock locked and > in get_partial_node() we do. I guess in case your list_lock is made raw > again by another patch, that explains a local_lock can't nest under it. > If not, then I would expect this to work (I don't think they ever nest > in the opposite order, also lockdep should tell us instead of > -ENOBOOT?), but might be missing something... RT used to convert list_lock to raw_spinlock_t, but no longer does. Whatever is going on, box does not emit a single sign of life with the full patch. > I'd rather not nest those locks in any case. I just need to convince > myself that the scenario the half-fix fixes is indeed the only one > that's needed and we're not leaving there other races that are just > harder to trigger... Yup. I can only state with confidence that the trouble I was able to easily reproduce was fixed up by serializing __slab_free(). Hopefully you'll find that's the only hole in need of plugging. -Mike