On Mon, Apr 12, 2010 at 10:45:16AM +0300, Avi Kivity wrote: > On 04/12/2010 10:15 AM, Nick Piggin wrote: > > > >>>Another thing is that the problem > >>>of fragmentation breakdown is not just a one-shot event that fills > >>>memory with pinned objects. It is a slow degredation. > >>> > >>>Especially when you use something like SLUB as the memory allocator > >>>which requires higher order allocations for objects which are pinned > >>>in kernel memory. > >>Won't the usual antifrag tactics apply? Try to allocate those > >>objects from the same block. > >"try" is the key point. > > We use the "try" tactic extensively. So long as there's a > reasonable chance of success, and a reasonable fallback on failure, > it's fine. > > Do you think we won't have reasonable success rates? Why? After the memory is fragmented? It's more or less irriversable. So success rates (to fill a specific number of huges pages) will be fine up to a point. Then it will be a continual failure. Sure, some workloads simply won't trigger fragmentation problems. Others will. > >>>Just running a few minutes of testing with a kernel compile in the > >>>background does not show the full picture. You really need a box that > >>>has been up for days running a proper workload before you are likely > >>>to see any breakdown. > >>I'm sure we'll be able to generate worst-case scenarios. I'm also > >>reasonably sure we'll be able to deal with them. I hope we won't > >>need to, but it's even possible to move dentries around. > >Pinned dentries? (which are the problem) That would be insane. > > Why? If you can isolate all the pointers into the dentry, allocate > the new dentry, make the old one point into the new one, hash it, > move the pointers, drop the old dentry. > > Difficult, yes, but insane? Yes. > >>>I'm sure it's horrible for planning if the RDBMS or VM boxes gradually > >>>get slower after X days of uptime. It's better to have consistent > >>>performance really, for anything except pure benchmark setups. > >>If that were the case we'd disable caches everywhere. General > >No we wouldn't. You can have consistent, predictable performance with > >caches. > > Caches have statistical performance. In the long run they average > out. In the short run they can behave badly. Same thing with large > pages, except the runs are longer and the wins are smaller. You don't understand. Caches don't suddenly or slowly stop working. For a particular pattern of workload, they statistically pretty much work the same all the time. > >>purpose computing is a best effort thing, we try to be fast on the > >>common case but we'll be slow on the uncommon case. Access to a bit > >Sure. And the common case for production systems like VM or databse > >servers that are up for hundreds of days is when they are running with > >a lot of uptime. Common case is not a fresh reboot into a 3 hour > >benchmark setup. > > Database are the easiest case, they allocate memory up front and > don't give it up. We'll coalesce their memory immediately and > they'll run happily ever after. Again, you're thinking about a benchmark setup. If you've got various admin things, backups, scripts running, probably web servers, application servers etc. Then it's not all that simple. And yes, Linux works pretty well for a multi-workload platform. You might be thinking too much about virtualization where you put things in sterile little boxes and take the performance hit. > Virtualization will fragment on overcommit, but the load is all > anonymous memory, so it's easy to defragment. Very little dcache on > the host. If virtualization is the main worry (which it seems that it is seeing as your TLB misses cost like 6 times more cachelines), then complexity should be pushed into the hypervisor, not the core kernel. > >>Non-linear kernel mapping moves the small page problem from > >>userspace back to the kernel, a really unhappy solution. > >Not unhappy for userspace intensive workloads. And user working sets > >I'm sure are growing faster than kernel working set. Also there would > >be nothing against compacting and merging kernel memory into larger > >pages. > > Well, I'm not against it, but that would be a much more intrusive > change than what this thread is about. Also, you'd need 4K dentries > etc, no? No. You'd just be defragmenting 4K worth of dentries at a time. Dentries (and anything that doesn't care about untranslated KVA) are trivial. Zero change for users of the code. This is going off-topic though, I don't want to hijack the thread with talk of nonlinear kernel. > >>Very large (object count, not object size) kernel caches can be > >>addressed by compacting them, but I hope we won't need to do that. > >You can't say that fragmentation is not a fundamental problem. And > >adding things like indirect pointers or weird crap adding complexity > >to code that deals with KVA IMO is not acceptable. So you can't > >just assert that you can "address" the problem. > > Mostly we need a way of identifying pointers into a data structure, > like rmap (after all that's what makes transparent hugepages work). And that involves auditing and rewriting anything that allocates and pins kernel memory. It's not only dentries. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>