Hello Mel, On Mon, Jan 18, 2016 at 11:01:47AM +0000, Mel Gorman wrote: > I didn't read too much into the history of this patch or the > implementation so take anything I say with a grain of salt. > > In the page-migration case, the obvious beneficiary of reduced IPI counts > is memory compaction and automatic NUMA balancing. For automatic NUMA > balancing, if a page is heavily shared then there is a possibility that > there is no optimal home node for the data and that no balancing should > take place. For compaction, it depends on whether it's for a THP allocation > or not -- heavy amounts of migration work is unlikely to be offset by THP > usage although this is a matter of opinion. > > In the case of memory compaction, it would make sense to limit the length > of the walk and abort if it the mapcount is too high particularly if the > compaction is for a THP allocation. In the case of THP, the cost of a long > rmap is not necessarily going to be offset by THP usage. Agreed that it wouldn't be worth migrating a KSM page with massive sharing just to allocate one more THP page and that limiting the length of the rmap_walk would avoid the hangs for compaction. The real problem is what happens after you stop migrating KSM pages. If you limit the length of the rmap_walk with a "magical" value (that I would have no clue how to set), you end up with compaction being entirely disabled and wasting CPU as well, if you have a bit too many KSM pages in the system with mapcount above the "magical" value. If compaction can be disabled all your memblock precious work becomes useless. After the memblocks become useless, the movable zone also becomes useless (as all anonymous movable pages in the system can suddenly becomes non movable and go above the "magical" rmap_walk limit). Then the memory offlining stops to work as well (it could take months in the worst case?). CMA stops working too so certain drivers will stop working. How can it be ok if you enable KSM and CMA entirely breaks? Because this is what happens if you stop migrating KSM pages if the rmap_walk would exceed the "magical" limit. Adding a sharing limit in KSM instead allows to keep KSM turned on at all times at 100% CPU load and it guarantees nothing KSM does could ever interfere with all other VM features. Everything is just guaranteed to work at all times. In fact if we were not to add the sharing limit to KSM, I think we would be better off to consider KSM pages unmovable like slab caches and drop the KSM migration entirely. So that KSM pages are restricted to those unmovable memblocks. At least compaction/memoryofflining/CMA would still have a slight chance to keep working despite KSM was enabled. > That said, it occurs to me that a full rmap walk is not necessary in all > cases and may be a more obvious starting point. For page reclaim, it only > matters if it's mapped once. page_referenced() does not register a ->done > handle and the callers only care about a positive count. One optimisation > for reclaim would be to return when referenced > 0 and be done with it. That > would limit the length of the walk in *some* cases and still activate the > page as expected. Limiting the page_referenced() rmap walk length was my first idea as well. We could also try to rotate the rmap_items list so the next loop starts with a new batch of pagetables and by retrying we would have a chance to clear all accessed bits for all pagetables eventually. The rmap_item->hlist is an hlist however, so rotation isn't possible at this time and we'd waste 8 bytes per stable node metadata if we were to add it. Still it would be trivial to add the rotation. However even the most benign change described above, can result in false positive OOM. By the time you move to the next page the accessed bits are quickly set again for all other pages you're not scanning. If just a bit too many pages are like that you won't be clearing young bits, or establishing swap entries fast enough and you'll trigger the OOM killer early despite lots of swap is available. And that's in addition to the compaction/memblock/offlining/CMA issues that also would have no solution. > There still would be cases where a full walk is necessary so this idea does > not fix all problems of concern but it would help reclaim in the case where > there are many KSM pages that are in use in a relatively obvious manner. More > importantly, such a modification would benefit cases other than KSM. What I'm afraid is that it would risk to destabilize the VM if we limit the rmap_walk length. That would exponentially increase the chance page_referenced() returns true at all times in turn not really solving much if too many KSM pages are shared above the "magical" rmap walk length. I'm not sure if that would benefit other cases either as they would end up in the same corner case. If instead we enforce a KSM sharing limit, there's no point in taking any more risk and sticking to the current model of atomic rmap_walks that simulates a physical page accessed bit, certainly looks much safer to me. I evaluated those possible VM modifications before adding the sharing limit to KSM and they didn't look definitive to me and there was no way I could guarantee the VM not to end up in weird corner cases unless I limited the KSM sharing limit in the first place. Something that instead I definitely like is the reduction of the IPIs in the rmap_walks, I mean your work and Hugh's patch that extends it to page_migration. That looks just great and it has zero risk to destabilize the VM, but it's also orthogonal to the KSM sharing limit. Not clearing all accessed bits in page_referenced or failing KSM migration depending on a magic rmap_walk limit I don't like very much. They can certainly hide the problem in some case: they would work absolutely great if there's just 1 KSM page in the system over the "magical" limit, but I don't think they've drawbacks and it looks flakey if too many KSM pages are above the "magical" limit. The KSM sharing limit value is way less magical than that: if it's set too low you can only lose some memory, you won't risk VM malfunction. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>