Re: Why do we let munmap fail?

Daniel Colascione <dancol@xxxxxxxxxx> · Mon, 21 May 2018 17:00:47 -0700

On Mon, May 21, 2018 at 4:32 PM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:

> On 05/21/2018 04:16 PM, Daniel Colascione wrote:
> > On Mon, May 21, 2018 at 4:02 PM Dave Hansen <dave.hansen@xxxxxxxxx>
wrote:
> >
> >> On 05/21/2018 03:54 PM, Daniel Colascione wrote:
> >>>> There are also certainly denial-of-service concerns if you allow
> >>>> arbitrary numbers of VMAs.  The rbtree, for instance, is O(log(n)),
but
> >>>> I 'd be willing to be there are plenty of things that fall over if
you
> >>>> let the ~65k limit get 10x or 100x larger.
> >>> Sure. I'm receptive to the idea of having *some* VMA limit. I just
think
> >>> it's unacceptable let deallocation routines fail.
> >> If you have a resource limit and deallocation consumes resources, you
> >> *eventually* have to fail a deallocation.  Right?
> > That's why robust software sets aside at allocation time whatever
resources
> > are needed to make forward progress at deallocation time.

> I think there's still a potential dead-end here.  "Deallocation" does
> not always free resources.

Sure, but the general principle applies: reserve resources when you *can*
fail so that you don't fail where you can't fail.

> > That's what I'm trying to propose here, essentially: if we specify
> > the VMA limit in terms of pages and not the number of VMAs, we've
> > effectively "budgeted" for the worst case of VMA splitting, since in
> > the worst case, you end up with one page per VMA.
> Not a bad idea, but it's not really how we allocate VMAs today.  You
> would somehow need per-process (mm?) slabs.  Such a scheme would
> probably, on average, waste half of a page per mm.

> > Done this way, we still prevent runaway VMA tree growth, but we can also
> > make sure that anyone who's successfully called mmap can successfully
call
> > munmap.

> I'd be curious how this works out, but I bet you end up reserving a lot
> more resources than people want.

I'm not sure. We're talking about two separate goals, I think. Goal #1 is
preventing the VMA tree becoming so large that we effectively DoS the
system. Goal #2 is about ensuring that the munmap path can't fail. Right
now, the system only achieves goal #1.

All we have to do to continue to achieve goal #1 is impose *some* sanity
limit on the VMA count, right? It doesn't really matter whether the limit
is specified in pages or number-of-VMAs so long as it's larger than most
applications will need but smaller than the DoS threshold. The resource
we're allocating at mmap time isn't really bytes of
struct-vm_area_struct-backing-storage, but sort of virtual anti-DoS
credits. Right now, these anti-DoS credits are denominated in number of
VMAs, but if we changed the denomination to page counts instead, we'd still
achieve goal #1 while avoiding the munmap-failing-with-ENOMEM weirdness.
Granted, if we make only this change, then munmap internal allocations
*still* fail if the actual VMA allocation failed, but I think the default
kernel OOM killer strategy will suffice for handling this kind of global
extreme memory pressure situation. All we have to do is change the *limit
check* during VMA creation, not the actual allocation strategy.

Another way of looking at it: Linux is usually configured to overcommit
with respect to *commit charge*. This behavior is well-known and widely
understood. What the VMA limit does is effectively overcommit with respect
to *address space*, which is weird and surprising because we normally think
of address space as being strictly accounted. If we can easily and cheaply
make address space actually strictly accounted, why not give it a shot?

Goal #2 is interesting as well, and I think it's what your slab-allocation
proposal would help address. If we literally set aside memory for all
possible VMAs, we'd ensure that internal allocations on the munmap path
could never fail. In the abstract, I'd like that (I'm a fan of strict
commit accounting generally), but I don't think it's necessary for fixing
the problem that motivated this thread.