On Mon, Apr 12, 2010 at 05:21:44PM +1000, Nick Piggin wrote: > On Mon, Apr 12, 2010 at 09:08:11AM +0200, Andrea Arcangeli wrote: > > On Mon, Apr 12, 2010 at 04:09:31PM +1000, Nick Piggin wrote: > > > One problem is that you need to keep a lot more memory free in order > > > for it to be reasonably effective. Another thing is that the problem > > > of fragmentation breakdown is not just a one-shot event that fills > > > memory with pinned objects. It is a slow degredation. > > > > set_recommended_min_free_kbytes seems to not be in function of ram > > size, 60MB aren't such a big deal. > > > > > Especially when you use something like SLUB as the memory allocator > > > which requires higher order allocations for objects which are pinned > > > in kernel memory. > > > > > > Just running a few minutes of testing with a kernel compile in the > > > background does not show the full picture. You really need a box that > > > has been up for days running a proper workload before you are likely > > > to see any breakdown. > > > > > > I'm sure it's horrible for planning if the RDBMS or VM boxes gradually > > > get slower after X days of uptime. It's better to have consistent > > > performance really, for anything except pure benchmark setups. > > > > All data I provided is very real, in addition to building a ton of > > packages and running emerge on /usr/portage I've been running all my > > real loads. Only problem I only run it for 1 day and half, but the > > load I kept it under was significant (surely a lot bigger inode/dentry > > load that any hypervisor usage would ever generate). > > OK, but as a solution for some kind of very specific and highly > optimized application already like RDBMS, HPC, hypervisor or JVM, > they could just be using hugepages themselves, couldn't they? > > It seems more interesting as a more general speedup for applications > that can't afford such optimizations? (eg. the common case for > most people) The reality is that very few are using hugetlbfs. I guess maybe 0.1% of KVM instances on phenom/nahlem chips are running on hugetlbfs for example (hugetlbfs boot reservation doesn't fit the cloud where you need all ram available in hugetlbfs and you still need 100% of unused ram as host pagecache for VDI), despite it would provide a >=6% boosts to all VM no matter what's running on the guest. Same goes for the JVM, maybe 0.1% of those runs on hugetlbfs. The commercial DBMS are the exception and they're probably closer to 99% running on hugetlbfs (and they've to keep using hugetlbfs until we move transparent hugepages in tmpfs). But as So there's a ton of wasted energy in my view. Like Ingo said, the faster they make the chips and the cheaper the RAM becomes, the more wasted energy as result of not using hugetlbfs. There's always more difference between cache sizes and ram sizes and also more difference between cache speeds and ram speeds. I don't see this trend ending and I'm not sure what is the better CPU that will make hugetlbfs worthless and unselectable at kernel configure time on x86 arch (if you build without generic). And I don't think it's feasible to ship a distro where 99% of apps that can benefit from hugepages are running with LD_PRELOAD=libhugetlbfs.so. It has to be transparent if we want to stop the waste. The main reason I've always been skeptical about transparent hugepages before I started working on this is the mess they generate on the whole kernel. So my priority of course has been to keep it self contained as much as possible. It kept spilling over and over until I managed to confine it to anonymous pages and fix whole mm/.c files with just a one liner (even the hugepage aware implementation that Johannes did still takes advantage of split_huge_page_pmd if the mprotect start/end isn't 2M naturally aligned, just to show how complex it would be to do it all at once). This will allow us to reach a solid base, and then later move to tmpfs and maybe later to pagecache and swapcache too. Pretending the whole kernel to become hugepage aware at once is a total mess, gup would need to return only head pages for example and breaking hundred of drivers in just that change. The compound_lock can be removed after you fix all those hundred of drivers and subsystems using gup... No big deal to remove it later, kind of you're removing the big kernel lock these days after 14 years of when it has been introduced. Plus I did all I could to try to keep it as black and white as possible. I think other OS are more gray in their approaches, my priority has been to pay for RAM anywhere I could if you set enabled=always, and to decrease as much as I could any risk of performance regressions in any workload. These days we can afford to lose 1G without much worry if it speedup the workload 8%, so I think the other designs are better for old hardware RAM constrainted and not very actual. On embedded with my patchset one should set enabled=madvise. Ingo suggested a per-process tweak to enable it selectively on certain apps, that is feasible too in the future (so people won't be forced to modify binaries to add madvise if they can't leave enabled=always). > Yes we do have the option to reserve pages and as far as I know it > should work, although I can't remember whether it deals with mlock. I think that is the right route to take for who needs the math-guarantees, and for many products it won't even be noticeable to enforce the math guarantee. It's kind of overcommit, somebody prefers the = 2 version and maybe they don't even notice it allows them to allocate less memory. Others prefers to be able to allocate ram without accounting for the unused virtual regions despite the bigger chance to run into the oom killer (and I'm in the latter camp for both overcommit sysctl and kernelcore= ;). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>