On Mon, 31 Mar 2014, Kelley Nielsen wrote: > The function try_to_unuse() is of quadratic complexity, with a lot of > wasted effort. It unuses swap entries one by one, potentially iterating > over all the page tables for all the processes in the system for each > one. > > This new proposed implementation of try_to_unuse simplifies its > complexity to linear. It iterates over the system's mms once, unusing > all the affected entries as it walks each set of page tables. It also > makes similar changes to shmem_unuse. I have now given this v5 a try, and it seems to be working well for the normal case, but not when under global memory pressure: swapoff gets stuck going round and around in the find_next_to_unuse() loop. The reason is, that loop thinks it's just cleaning up the orphaned swapcache pages, but in fact there can be swap entries reinserted back into process space since the mmlist loop passed over them: on those the find_next_to_unuse() loop keeps on reading in from swap, then deleting from swapcache; but does nothing actually to free up the swap. Which draws attention to a feature of the old per-page approach, which you need to incorporate into your approach. To make forward progress under global memory pressure, a page should be mapped into all its mms and the swap then freed as soon as possible. Any delay makes it more likely that page reclaim will reinsert a swap entry; and until the page is removed from swapcache, reclaim cannot assign new swap to it. So, don't just add allowance for still-mapped pages into that final find_next_to_unuse() loop - that's necessary (a goto back to the start, or perhaps to after the shmem_unuse - I'm not certain), but you also need to delete_from_swap_cache() each page in turn as soon as its swapcount goes down to 0, to minimize the chance of reusing the swap. Given the way copy_one_pte() adds dst_mm to mmlist after src_mm, mms sharing the same page should be clustered closely in the mmlist. For this patch it will probably be good enough not to complicate your current structure, but just add the delete_from_swap_cache() when swapcount 0 into unuse_pte() or nearby. (And don't be disheartened if once you've put it in, I suggest you move it around: this code has been stable for a long while, there's a lot to consider in reorganizing it, not obvious to either of us.) I think a further patch, which does an rmap walk when the first pte has been found (if swapcount indicates more entries to update) while still holding the page lock, will probably be a good addition on top. (That ksm_might_need_to_copy(): hmm, yes, KSM complicates this.) It's tempting to think that with such an rmap walk under page lock in place, there would no longer be a need for the find_next_to_unuse() loop to terminate and go back to mmlist. Tempting, but I wouldn't bet on it: we shall probably find odd corners which still make retry necessary (KSM? mremap move? others?). While I remember, two other things you can address in separate patches (no need to worry until this is stable). Don't forget to delete __locate() and radix_tree_locate_item() from lib/radix-tree.c and include/linux/radix-tree.h: that will improve your diffstat. And I think it would help everyone if you change the ancient strange "unuse" throughout to "swapoff" (particularly now that s390 are adding a pte_unused() which has nothing to do with all this). Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>