On Fri, Sep 16, 2022 at 02:46:39AM -0700, Kees Cook wrote: > On Fri, Sep 16, 2022 at 09:38:33AM +0100, Matthew Wilcox wrote: > > On Thu, Sep 15, 2022 at 05:59:56PM -0600, Yu Zhao wrote: > > > I think this is a manifest of the lockdep warning I reported a couple > > > of weeks ago: > > > https://lore.kernel.org/r/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9b3EQ@xxxxxxxxxxxxxx/ > > > > That would certainly match the symptoms. > > > > Turning vmap_lock into an NMI-safe lock would be bad. I don't even know > > if we have primitives for that (it's not like you can disable an NMI > > ...) > > > > I don't quite have time to write a patch right now. Perhaps something > > like: > > > > struct vmap_area *find_vmap_area_nmi(unsigned long addr) > > { > > struct vmap_area *va; > > > > if (spin_trylock(&vmap_area_lock)) > > return NULL; > > va = __find_vmap_area(addr, &vmap_area_root); > > spin_unlock(&vmap_area_lock); > > > > return va; > > } > > > > and then call find_vmap_area_nmi() in check_heap_object(). I may have > > the polarity of the return value of spin_trylock() incorrect. > > I think we'll need something slightly tweaked, since this would > return NULL under any contention (and a NULL return is fatal in > check_heap_object()). It seems like we need to explicitly check > for being in nmi context in check_heap_object() to deal with it? > Like this (only build tested): Right, and Ulad is right about it beig callable from any context. I think the longterm solution is to make the vmap_area_root tree walkable under RCU protection. For now, let's have a distinct return code (ERR_PTR(-EAGAIN), perhaps?) to indicate that we've hit contention. It generally won't matter if we hit it in process context because hardening doesn't have to be 100% reliable to be useful. Erm ... so what prevents this race: CPU 0 CPU 1 copy_to_user() check_heap_object() area = find_vmap_area(addr) __purge_vmap_area_lazy() merge_or_add_vmap_area_augment() __merge_or_add_vmap_area() kmem_cache_free(vmap_area_cachep, va); if (n > area->va_end - addr) { Yes, it's a race in the code that allocated this memory; they're simultaneously calling copy_to_user() and __vunmap(). We'll catch this bad behaviour sooner rather than later, but sometimes in trying to catch this bug, we'll get caught by the bug and go splat. I don't know that we need to go through heroics to be sure we don't get caught by this bug. It already has to run a workqueue to do the freeing. We could delay it even further with RCU or something, but we're only trading off one kind of badness for another.