Re: [PATCH 00 of 41] Transparent Hugepage Support #17

Nick Piggin <npiggin@xxxxxxx> · Mon, 12 Apr 2010 19:26:15 +1000

On Mon, Apr 12, 2010 at 12:03:18PM +0300, Avi Kivity wrote:
> On 04/12/2010 11:28 AM, Nick Piggin wrote:
> >
> >>We use the "try" tactic extensively.  So long as there's a
> >>reasonable chance of success, and a reasonable fallback on failure,
> >>it's fine.
> >>
> >>Do you think we won't have reasonable success rates?  Why?
> >After the memory is fragmented? It's more or less irriversable. So
> >success rates (to fill a specific number of huges pages) will be fine
> >up to a point. Then it will be a continual failure.
> 
> So we get just a part of the win, not all of it.

It can degrade over time. This is the difference. Two idencial workloads
may have performance X and Y depending on whether uptime is 1 day or 20
days.

> >Sure, some workloads simply won't trigger fragmentation problems.
> >Others will.
> 
> Some workloads benefit from readahead.  Some don't.  In fact,
> readahead has a higher potential to reduce performance.
> 
> Same as with many other optimizations.

Do you see any difference with your examples and this issue?

> >>Why?  If you can isolate all the pointers into the dentry, allocate
> >>the new dentry, make the old one point into the new one, hash it,
> >>move the pointers, drop the old dentry.
> >>
> >>Difficult, yes, but insane?
> >Yes.
> 
> Well, I'll accept what you say since I'm nowhere near as familiar
> with the code.  But maybe someone insane will come along and do it.

And it'll get nacked :) And it's not only dcache that can cause a
problem. This is part of the whole reason it is insane. It is insane
to only fix the dcache, because if you accept the dcache is a problem
that needs such complexity to fix, then you must accept the same for
the inode caches, the buffer head caches, vmas, radix tree nodes, files
etc. no?

> >>Caches have statistical performance.  In the long run they average
> >>out.  In the short run they can behave badly.  Same thing with large
> >>pages, except the runs are longer and the wins are smaller.
> >You don't understand. Caches don't suddenly or slowly stop working.
> >For a particular pattern of workload, they statistically pretty much
> >work the same all the time.
> 
> Yet your effective cache size can be reduced by unhappy aliasing of
> physical pages in your working set.  It's unlikely but it can
> happen.
> 
> For a statistical mix of workloads, huge pages will also work just
> fine.  Perhaps not all of them, but most (those that don't fill
> _all_ of memory with dentries).

Like I said, you don't need to fill all memory with dentries, you
just need to be allocating higher order kernel memory and end up
fragmenting your reclaimable pools.

And it's not a statistical mix that is the problem. The problem is
that the workloads that do cause fragmentation problems will run well
for 1 day or 5 days and then degrade. And it is impossible to know
what will degrade and what won't and by how much.

I'm not saying this is a showstopper, but it does really suck.

> >>Database are the easiest case, they allocate memory up front and
> >>don't give it up.  We'll coalesce their memory immediately and
> >>they'll run happily ever after.
> >Again, you're thinking about a benchmark setup. If you've got various
> >admin things, backups, scripts running, probably web servers,
> >application servers etc. Then it's not all that simple.
> 
> These are all anonymous/pagecache loads, which we deal with well.

Huh? They also involve sockets, files, and involve all of the above
data structures I listed and many more.

> >And yes, Linux works pretty well for a multi-workload platform. You
> >might be thinking too much about virtualization where you put things
> >in sterile little boxes and take the performance hit.
> >
> 
> People do it for a reason.

The reasoning is not always sound though. And also people do other
things. Including increasingly better containers and workload
management in the single kernel.

> >>Virtualization will fragment on overcommit, but the load is all
> >>anonymous memory, so it's easy to defragment.  Very little dcache on
> >>the host.
> >If virtualization is the main worry (which it seems that it is
> >seeing as your TLB misses cost like 6 times more cachelines),
> 
> (just 2x)
> 
> >then complexity should be pushed into the hypervisor, not the
> >core kernel.
> 
> The whole point behind kvm is to reuse the Linux core.  If we have
> to reimplement Linux memory management and scheduling, then it's a
> failure.

And if you need to add complexity to the Linux core for it, it's
also a failure.

I'm not saying to reimplement things, but if you had a little bit
more support perhaps. Anyway it's just ideas, I'm not saying that
transparent hugepages is wrong simply because KVM is a big user and it
could be implemented in another way.

But if it is possible for KVM to use libhugetlb with just a bit of
support from the kernel, then it goes some way to reducing the
need for transparent hugepages.

> >>Well, I'm not against it, but that would be a much more intrusive
> >>change than what this thread is about.  Also, you'd need 4K dentries
> >>etc, no?
> >No. You'd just be defragmenting 4K worth of dentries at a time.
> >Dentries (and anything that doesn't care about untranslated KVA)
> >are trivial. Zero change for users of the code.
> 
> I see.
> 
> >This is going off-topic though, I don't want to hijack the thread
> >with talk of nonlinear kernel.
> 
> Too bad, it's interesting.

It sure is, we can start another thread.

> >>Mostly we need a way of identifying pointers into a data structure,
> >>like rmap (after all that's what makes transparent hugepages work).
> >And that involves auditing and rewriting anything that allocates
> >and pins kernel memory. It's not only dentries.
> 
> Not everything, just the major users that can scale with the amount
> of memory in the machine.

Well you need to audit, to determine if it is going to be a problem or
not, and it is more than only dentries. (but even dentries would be a
nightmare considering how widely they're used and how much they're
passed around the vfs and filesystems).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>