On Mon, Apr 12, 2010 at 12:03:18PM +0300, Avi Kivity wrote: > On 04/12/2010 11:28 AM, Nick Piggin wrote: > > > >>We use the "try" tactic extensively. So long as there's a > >>reasonable chance of success, and a reasonable fallback on failure, > >>it's fine. > >> > >>Do you think we won't have reasonable success rates? Why? > >After the memory is fragmented? It's more or less irriversable. So > >success rates (to fill a specific number of huges pages) will be fine > >up to a point. Then it will be a continual failure. > > So we get just a part of the win, not all of it. It can degrade over time. This is the difference. Two idencial workloads may have performance X and Y depending on whether uptime is 1 day or 20 days. > >Sure, some workloads simply won't trigger fragmentation problems. > >Others will. > > Some workloads benefit from readahead. Some don't. In fact, > readahead has a higher potential to reduce performance. > > Same as with many other optimizations. Do you see any difference with your examples and this issue? > >>Why? If you can isolate all the pointers into the dentry, allocate > >>the new dentry, make the old one point into the new one, hash it, > >>move the pointers, drop the old dentry. > >> > >>Difficult, yes, but insane? > >Yes. > > Well, I'll accept what you say since I'm nowhere near as familiar > with the code. But maybe someone insane will come along and do it. And it'll get nacked :) And it's not only dcache that can cause a problem. This is part of the whole reason it is insane. It is insane to only fix the dcache, because if you accept the dcache is a problem that needs such complexity to fix, then you must accept the same for the inode caches, the buffer head caches, vmas, radix tree nodes, files etc. no? > >>Caches have statistical performance. In the long run they average > >>out. In the short run they can behave badly. Same thing with large > >>pages, except the runs are longer and the wins are smaller. > >You don't understand. Caches don't suddenly or slowly stop working. > >For a particular pattern of workload, they statistically pretty much > >work the same all the time. > > Yet your effective cache size can be reduced by unhappy aliasing of > physical pages in your working set. It's unlikely but it can > happen. > > For a statistical mix of workloads, huge pages will also work just > fine. Perhaps not all of them, but most (those that don't fill > _all_ of memory with dentries). Like I said, you don't need to fill all memory with dentries, you just need to be allocating higher order kernel memory and end up fragmenting your reclaimable pools. And it's not a statistical mix that is the problem. The problem is that the workloads that do cause fragmentation problems will run well for 1 day or 5 days and then degrade. And it is impossible to know what will degrade and what won't and by how much. I'm not saying this is a showstopper, but it does really suck. > >>Database are the easiest case, they allocate memory up front and > >>don't give it up. We'll coalesce their memory immediately and > >>they'll run happily ever after. > >Again, you're thinking about a benchmark setup. If you've got various > >admin things, backups, scripts running, probably web servers, > >application servers etc. Then it's not all that simple. > > These are all anonymous/pagecache loads, which we deal with well. Huh? They also involve sockets, files, and involve all of the above data structures I listed and many more. > >And yes, Linux works pretty well for a multi-workload platform. You > >might be thinking too much about virtualization where you put things > >in sterile little boxes and take the performance hit. > > > > People do it for a reason. The reasoning is not always sound though. And also people do other things. Including increasingly better containers and workload management in the single kernel. > >>Virtualization will fragment on overcommit, but the load is all > >>anonymous memory, so it's easy to defragment. Very little dcache on > >>the host. > >If virtualization is the main worry (which it seems that it is > >seeing as your TLB misses cost like 6 times more cachelines), > > (just 2x) > > >then complexity should be pushed into the hypervisor, not the > >core kernel. > > The whole point behind kvm is to reuse the Linux core. If we have > to reimplement Linux memory management and scheduling, then it's a > failure. And if you need to add complexity to the Linux core for it, it's also a failure. I'm not saying to reimplement things, but if you had a little bit more support perhaps. Anyway it's just ideas, I'm not saying that transparent hugepages is wrong simply because KVM is a big user and it could be implemented in another way. But if it is possible for KVM to use libhugetlb with just a bit of support from the kernel, then it goes some way to reducing the need for transparent hugepages. > >>Well, I'm not against it, but that would be a much more intrusive > >>change than what this thread is about. Also, you'd need 4K dentries > >>etc, no? > >No. You'd just be defragmenting 4K worth of dentries at a time. > >Dentries (and anything that doesn't care about untranslated KVA) > >are trivial. Zero change for users of the code. > > I see. > > >This is going off-topic though, I don't want to hijack the thread > >with talk of nonlinear kernel. > > Too bad, it's interesting. It sure is, we can start another thread. > >>Mostly we need a way of identifying pointers into a data structure, > >>like rmap (after all that's what makes transparent hugepages work). > >And that involves auditing and rewriting anything that allocates > >and pins kernel memory. It's not only dentries. > > Not everything, just the major users that can scale with the amount > of memory in the machine. Well you need to audit, to determine if it is going to be a problem or not, and it is more than only dentries. (but even dentries would be a nightmare considering how widely they're used and how much they're passed around the vfs and filesystems). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>