Re: [Bug 216489] New: Machine freezes due to memory lock

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Fri, 16 Sep 2022 15:15:05 +0100

On Fri, Sep 16, 2022 at 02:46:39AM -0700, Kees Cook wrote:
> On Fri, Sep 16, 2022 at 09:38:33AM +0100, Matthew Wilcox wrote:
> > On Thu, Sep 15, 2022 at 05:59:56PM -0600, Yu Zhao wrote:
> > > I think this is a manifest of the lockdep warning I reported a couple
> > > of weeks ago:
> > > https://lore.kernel.org/r/CAOUHufaPshtKrTWOz7T7QFYUNVGFm0JBjvM700Nhf9qEL9b3EQ@xxxxxxxxxxxxxx/
> > 
> > That would certainly match the symptoms.
> > 
> > Turning vmap_lock into an NMI-safe lock would be bad.  I don't even know
> > if we have primitives for that (it's not like you can disable an NMI
> > ...)
> > 
> > I don't quite have time to write a patch right now.  Perhaps something
> > like:
> > 
> > struct vmap_area *find_vmap_area_nmi(unsigned long addr)
> > {
> >         struct vmap_area *va;
> > 
> >         if (spin_trylock(&vmap_area_lock))
> > 		return NULL;
> >         va = __find_vmap_area(addr, &vmap_area_root);
> >         spin_unlock(&vmap_area_lock);
> > 
> >         return va;
> > }
> > 
> > and then call find_vmap_area_nmi() in check_heap_object().  I may have
> > the polarity of the return value of spin_trylock() incorrect.
> 
> I think we'll need something slightly tweaked, since this would
> return NULL under any contention (and a NULL return is fatal in
> check_heap_object()). It seems like we need to explicitly check
> for being in nmi context in check_heap_object() to deal with it?
> Like this (only build tested):

Right, and Ulad is right about it beig callable from any context.  I think
the longterm solution is to make the vmap_area_root tree walkable under
RCU protection.

For now, let's have a distinct return code (ERR_PTR(-EAGAIN), perhaps?) to
indicate that we've hit contention.  It generally won't matter if we
hit it in process context because hardening doesn't have to be 100%
reliable to be useful.

Erm ... so what prevents this race:

CPU 0					CPU 1
copy_to_user()
check_heap_object()
area = find_vmap_area(addr)
					__purge_vmap_area_lazy()
					merge_or_add_vmap_area_augment()
					__merge_or_add_vmap_area()
					kmem_cache_free(vmap_area_cachep, va);
if (n > area->va_end - addr) {

Yes, it's a race in the code that allocated this memory; they're
simultaneously calling copy_to_user() and __vunmap().  We'll catch
this bad behaviour sooner rather than later, but sometimes in trying to
catch this bug, we'll get caught by the bug and go splat.

I don't know that we need to go through heroics to be sure we don't
get caught by this bug.  It already has to run a workqueue to do the
freeing.  We could delay it even further with RCU or something, but
we're only trading off one kind of badness for another.