On Mon, May 21, 2018 at 4:32 PM Dave Hansen <dave.hansen@xxxxxxxxx> wrote: > On 05/21/2018 04:16 PM, Daniel Colascione wrote: > > On Mon, May 21, 2018 at 4:02 PM Dave Hansen <dave.hansen@xxxxxxxxx> wrote: > > > >> On 05/21/2018 03:54 PM, Daniel Colascione wrote: > >>>> There are also certainly denial-of-service concerns if you allow > >>>> arbitrary numbers of VMAs. The rbtree, for instance, is O(log(n)), but > >>>> I 'd be willing to be there are plenty of things that fall over if you > >>>> let the ~65k limit get 10x or 100x larger. > >>> Sure. I'm receptive to the idea of having *some* VMA limit. I just think > >>> it's unacceptable let deallocation routines fail. > >> If you have a resource limit and deallocation consumes resources, you > >> *eventually* have to fail a deallocation. Right? > > That's why robust software sets aside at allocation time whatever resources > > are needed to make forward progress at deallocation time. > I think there's still a potential dead-end here. "Deallocation" does > not always free resources. Sure, but the general principle applies: reserve resources when you *can* fail so that you don't fail where you can't fail. > > That's what I'm trying to propose here, essentially: if we specify > > the VMA limit in terms of pages and not the number of VMAs, we've > > effectively "budgeted" for the worst case of VMA splitting, since in > > the worst case, you end up with one page per VMA. > Not a bad idea, but it's not really how we allocate VMAs today. You > would somehow need per-process (mm?) slabs. Such a scheme would > probably, on average, waste half of a page per mm. > > Done this way, we still prevent runaway VMA tree growth, but we can also > > make sure that anyone who's successfully called mmap can successfully call > > munmap. > I'd be curious how this works out, but I bet you end up reserving a lot > more resources than people want. I'm not sure. We're talking about two separate goals, I think. Goal #1 is preventing the VMA tree becoming so large that we effectively DoS the system. Goal #2 is about ensuring that the munmap path can't fail. Right now, the system only achieves goal #1. All we have to do to continue to achieve goal #1 is impose *some* sanity limit on the VMA count, right? It doesn't really matter whether the limit is specified in pages or number-of-VMAs so long as it's larger than most applications will need but smaller than the DoS threshold. The resource we're allocating at mmap time isn't really bytes of struct-vm_area_struct-backing-storage, but sort of virtual anti-DoS credits. Right now, these anti-DoS credits are denominated in number of VMAs, but if we changed the denomination to page counts instead, we'd still achieve goal #1 while avoiding the munmap-failing-with-ENOMEM weirdness. Granted, if we make only this change, then munmap internal allocations *still* fail if the actual VMA allocation failed, but I think the default kernel OOM killer strategy will suffice for handling this kind of global extreme memory pressure situation. All we have to do is change the *limit check* during VMA creation, not the actual allocation strategy. Another way of looking at it: Linux is usually configured to overcommit with respect to *commit charge*. This behavior is well-known and widely understood. What the VMA limit does is effectively overcommit with respect to *address space*, which is weird and surprising because we normally think of address space as being strictly accounted. If we can easily and cheaply make address space actually strictly accounted, why not give it a shot? Goal #2 is interesting as well, and I think it's what your slab-allocation proposal would help address. If we literally set aside memory for all possible VMAs, we'd ensure that internal allocations on the munmap path could never fail. In the abstract, I'd like that (I'm a fan of strict commit accounting generally), but I don't think it's necessary for fixing the problem that motivated this thread.