Re: [PATCH 00 of 41] Transparent Hugepage Support #17

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Sat, 10 Apr 2010 22:00:37 +0200

On Sat, Apr 10, 2010 at 09:47:51PM +0200, Ingo Molnar wrote:
> 
> * Avi Kivity <avi@xxxxxxxxxx> wrote:
> 
> > > I think what would be needed is some non-virtualization speedup example of 
> > > a 'non-special' workload, running on the native/host kernel. 'sort' is an 
> > > interesting usecase - could it be patched to use hugepages if it has to 
> > > sort through lots of data?
> > 
> > In fact it works well unpatched, the 6% I measured was with the system sort.
> 
> Yes - but you intentionally sorted something large - the question is, how big 
> is the slowdown with small sizes (if there's a slowdown), where is the 
> break-even point (if any)?

The only chance there is a slowdown is if try_to_compact_pages or
try_to_free_pages takes longer and runs more frequently with order 9
allocations than try_to_free_pages would on a 0 order allocation. That
is only a problem for short-lived frequent allocations in case memory
compaction fails to provide some hugepage (as it'll run multiple times
even if not needed, which is what the future exponential backoff logic
is about).

This is why I recommended to run any "real life DB" benchmark with
both transparent_hugepage/defrag set to both "always" and
"never". "never" will practically make any slowdown impossible to
measure. The only other case where there's a potential for minor
slowdown compared to 4k pages is COW, the 2M copy will trash the cache
and we need it to use non temporal stores, but even that will be
offseted by having a boost in TLB terms saving memory accesses in the
ptes. Which is my reason for avoiding any optimistic prefault and to
only go huge when we get the TLB benefit in return (not just the
pagefault speedup, the pagefault speedup is a double edge sword, it
trashes more caches so you need more than that for it to be worth it).

> Would be nice to try because there's a lot of transformations within Gimp - 
> and Gimp can be scripted. It's also a test for negatives: if there is an 
> across-the-board _lack_ of speedups, it shows that it's not really general 
> purpose but more specialistic.
> 
> If the optimization is specialistic, then that's somewhat of an argument 
> against automatic/transparent handling. (even though even if the beneficiaries 
> turn out to be only special workloads then transparency still has advantages.)
> 
> > I thought ray tracers with large scenes should show a nice speedup, but 
> > setting this up is beyond my capabilities.
> 
> Oh, this tickled some memories: x264 compressed encoding can be very cache and 
> TLB intense. Something like the encoding of a 350 MB video file:
> 
>   wget http://media.xiph.org/video/derf/y4m/soccer_4cif.y4m       # NOTE: 350 MB!
>   x264 --crf 20 --quiet soccer_4cif.y4m -o /dev/null --threads 4
> 
> would be another thing worth trying with transparent-hugetlb enabled.
> 
> (i've Cc:-ed x264 benchmarking experts - in case i missed something)

It definitely worth trying... nice idea. But we need glibc to increase
vm_end in 2M aligned chunk, otherwise we've to workaround it in the
kernel, for short lived allocations like gcc to take advantage of
this. I managed to get 200M of gcc (of ~500M total) of translate.o
into hugepages with two glibc params, but I want it all in transhuge
before I measure it. I'm running it on the workstation that had 1 day
and half of uptime and it's still building more packages as I write
this and running large vfs loads in /usr and maildir.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>