Re: [PATCH 00 of 41] Transparent Hugepage Support #17

Nick Piggin <npiggin@xxxxxxx> · Mon, 12 Apr 2010 16:48:51 +1000

On Mon, Apr 12, 2010 at 09:18:56AM +0300, Pekka Enberg wrote:
> On Mon, Apr 12, 2010 at 9:09 AM, Nick Piggin <npiggin@xxxxxxx> wrote:
> >> I think Andrea and Mel and you demonstrated that while defrag is futile in
> >> theory (we can always fill up all of RAM with dentries and there's no 2MB
> >> allocation possible), it seems rather usable in practice.
> >
> > One problem is that you need to keep a lot more memory free in order
> > for it to be reasonably effective. Another thing is that the problem
> > of fragmentation breakdown is not just a one-shot event that fills
> > memory with pinned objects. It is a slow degredation.
> >
> > Especially when you use something like SLUB as the memory allocator
> > which requires higher order allocations for objects which are pinned
> > in kernel memory.
> 
> I guess we'd need to merge the SLUB defragmentation patches to fix that?

No that's a different problem. And SLUB 'defragmentation' isn't really
defragmentation, it is just selective reclaim.

Reclaimable slab memory allocations are not the problem. The problem are
the ones that you can't reclaim. The problem is this:

- Memory gets fragmented by allocation of pinned pages within larger
  ranges so that we cannot allocate that large range.

- Anti-frag improves this by putting pinned pages in different ranges
  and unpinned pages in different ranges. So the ranges of unpinned
  pages can get reclaimed to use a larger range.

- However there is still an underlying problem of pinned pages causing
  fragmentation within their ranges.

- If you require higher order allocations for pinned pages especially,
  then you will end up with your pinned ranges becoming fragmented and
  unable to satisfy the higher order allocation. So you must expand your
  pinned ranges into unpinned.

If you only do 4K slab allocations, then things get better, however it
can of course still break down if the pinned allocation requirement
grows large. It's really hard to control this because it includes
anything from open files to radix tree nodes to page tables and anything
that any driver or subsystem allocates with kmalloc.

Basically, if you were going to add another level of indirection to
solve that, you may as well just go ahead and do nonlinear mappings of
the kernel memory with page tables, so you'd only have to fix up places
that require translated addresses rather than everything that touches
KVA. This would still be a big headache.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>