On Mon, Jan 18, 2016 at 01:10:42AM -0800, Hugh Dickins wrote: > Puhleese. Of course we have a bug with respect to KSM pages in > migrate_pages(): I already said as much, though I used the example > of mbind(), that being the one you had mentioned. Just do something > like mremap() does, a temporary ksm_madvise(,,,MADV_UNMERGEABLE,). > > Or are you suggesting that when MPOL_MF_MOVE_ALL meets a KSM page, > it is actually correct to move every page that shares the same data > to the node of this process's choice? So you think it's wrong that the migrate_pages() syscall can move KSM pages too? If yes, why don't you simply add a check for PageKSM in migrate_pages? Forbidding all KSM pages to be migrated would be a simple and black and white fix for it. I mean either we allow all KSM pages to be migrated, or none. I don't like that we allow some to be migrated if it looks like it will take less than 60seconds to do so... rmap_walks always were intended to be atomic. Even assuming migrate_pages should be fixed to prevent KSM pages to be migrated, for me that was not a bug but a feature. It allowed me to code the simplest reproducer to show how long it takes to do a rmap_walk with a KSM page with high sharing. Before I wrote this testcase it wasn't trivial at all to measure or even reproduce the hang no matter what you throw it at it. In fact the hang takes normally days or weeks to reproduce, but when they hit systems go down. The testcase I posted simplifies the reproduction of the VM badness tremendously. Page migration ultimately was my concern. Just as things stands today even the migrate_pages hammer didn't look entirely safe (no matter if the user is an artificial test like mine that calls migrate_pages for whatever reason, or simply a benchmark) but that's certainly not the primary concern. > MPOL_MF_MOVE_ALL was introduced some while before KSM: it makes sense > for anon pages, and for file pages, but it was mere oversight that we > did not stop it from behaving this way on KSM pages. > > Even sillier now that we have Petr's merge_across_nodes settable to 0. The merge_across_nodes set to 0 must not break either when migrate_pages moves a KSM page to a different node. Nothing wrong with that, in fact it's a good stress test. > Forget that above program, it's easily fixed (or rather, the kernel > is easily fixed for it): please argue with a better example. I don't think I can write a simpler or smaller testcase that shows exactly how bad things can get with the rmap walks. If I abused a "feature" in migrate_pages to achieve it, that still made my life easier at reproducing the VM badness that has to be fixed somehow. > Fair enough. I'm not saying it's a big worry, just that the design > would have been more elegant and appealing without this side to it. > I wonder who's going to tune stable_node_chains_prune_millisecs :) stable_node_chains_prune_millisecs can be set to any value and it won't make much difference actually, it's a tradeoff between KSM CPU usage and a slight delay in freeing the KSM metadata, but it's not like it has a big impact on anything. This isn't metadata that amounts to significant amounts of memory nor that it will grow fast. But that being a fixed number, I made it sysfs configurable. I've a much easier time to set the max sharing limit and a deep garbage collection event every couple of millisecs, than dynamically deciding what's the magic breakpoint page_mapcount number where page_migration, try_to_unmap and page_referenced should bail out or system hangs... but if they happen to bail out too soon the VM will be in a DoS. > The VM gets slower and more complicated: the CPUs get faster and smarter. Eheh, luckily it's not that bad ;), your page_migration TLB flush batching should provide a significant boost that may allow to increase the max sharing limit already compared to the current value I set in the patch. Complexity increases yes, but VM gets faster as your patch shows. The testcase I posted is precisely the program that you should run to test the effectiveness of your page_migration improvement. Perhaps you can add a bit of printf so you won't have to use strace -tt :). > I'll think about it more. Back this being an HW issue, as far as the VM is concerned, all rmap_walks were intended to simulate having the accessed or dirty bit in the page structure (kind of s390?) not in the plenty of pagetables that maps the page. We collapse all pagetable bits in a single per-physical page bit. This is why they're atomic and we don't break the loop, and my patch just allows this model to be retained. If we start breaking the loop and in turn break the atomicity of the rmap_walks, we just increase false positive OOM risk in some unpredictable way. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>