Re: [PATCH v3 2/2] mm: rid swapoff of quadratic complexity

Vineeth Pillai <vpillai@xxxxxxxxxxxxxxxx> · Tue, 1 Jan 2019 13:24:15 -0500

Thanks a lot for the fixes and detailed explanation Hugh! I shall fold all the changes from you and Huang in the next iteration.

Thanks for all the suggestions and comments as well. I am looking into all those and will include all the changes in the next version. Will discuss over mail in case of any clarifications.

Thanks again!
~Vineeth

On Mon, Dec 31, 2018 at 7:44 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote:
On Mon, 3 Dec 2018, Vineeth Remanan Pillai wrote:



> This patch was initially posted by Kelley(kelleynnn@xxxxxxxxx).

> Reposting the patch with all review comments addressed and with minor

> modifications and optimizations. Tests were rerun and commit message

> updated with new results.

> 

> The function try_to_unuse() is of quadratic complexity, with a lot of

> wasted effort. It unuses swap entries one by one, potentially iterating

> over all the page tables for all the processes in the system for each

> one.

> 

> This new proposed implementation of try_to_unuse simplifies its

> complexity to linear. It iterates over the system's mms once, unusing

> all the affected entries as it walks each set of page tables. It also

> makes similar changes to shmem_unuse.



Hi Vineeth, please fold in fixes below before reposting your

"mm,swap: rid swapoff of quadratic complexity" patch -

or ask for more detail if unclear.  I could split it up,

of course, but since they should all (except perhaps one)

just be merged into the base patch before going any further,

it'll save me time to keep them together here and just explain:-



shmem_unuse_swap_entries():

If a user fault races with swapoff, it's very normal for

shmem_swapin_page() to return -EEXIST, and the old code was

careful not to pass back any error but -ENOMEM; whereas on mmotm,

/usr/sbin/swapoff often failed silently because it got that EEXIST.



shmem_unuse():

A couple of crashing bugs there: a list_del_init without holding the

mutex, and too much faith in the "safe" in list_for_each_entry_safe():

it does assume that the mutex has been held throughout, you (very

nicely!) drop it, but that does require "next" to be re-evaluated.



shmem_writepage():

Not a bug fix, this is the "except perhaps one": minor optimization,

could be left out, but if shmem_unuse() is going through the list

in the forward direction, and may completely unswap a file and del

it from the list, then pages from that file can be swapped out to

*other* swap areas after that, and it be reinserted in the list:

better to reinsert it behind shmem_unuse()'s cursor than in front

of it, which would entail a second pointless pass over that file.



try_to_unuse():

Moved up the assignment of "oldi = i" (and changed the test to

"oldi <= i"), so as not to get trapped in that find_next_to_unuse()

loop when find_get_page() does not find it.



try_to_unuse():

But the main problem was passing entry.val to find_get_page() there:

that used to be correct, but since f6ab1f7f6b2d we need to pass just

the offset - as it stood, it could only find the pages when swapping

off area 0 (a similar issue was fixed in shmem_replace_page() recently).

That (together with the late oldi assignment) was why my swapoffs were

hanging on SWAP_HAS_CACHE swap_map entries.



With those changes, it all seems to work rather well, and is a nice

simplification of the source, in addition to removing the quadratic

complexity. To my great surprise, the KSM pages are already handled

fairly well - the ksm_might_need_to_copy() that has long been in

unuse_pte() turns out to do (almost) a good enough job already,

so most users of KSM and swapoff would never see any problem.

And I'd been afraid of swapin readahead causing spurious -ENOMEMs,

but have seen nothing of that in practice (though something else

in mmotm does appear to use up more memory than before).



My remaining criticisms would be:



As Huang Ying pointed out in other mail, there is a danger of

livelock (or rather, hitting the MAX_RETRIES limit) when a multiply

mapped page (most especially a KSM page, whose mappings are not

likely to be nearby in the mmlist) is swapped out then partially

swapped off then some ptes swapped back out.  As indeed the

"Under global memory pressure" comment admits.



I have hit the MAX_RETRIES 3 limit several times in load testing,

not investigated but I presume due to such a multiply mapped page,

so at present we do have a regression there.  A very simple answer

would be to remove the retries limiting - perhaps you just added

that to get around the find_get_page() failure before it was

understood?  That does then tend towards the livelock alternative,

but you've kept the signal_pending() check, so there's still the

same way out as the old technique had (but greater likelihood of

needing it with the new technique).  The right fix will be to do

an rmap walk to unuse all the swap entries of a single anon_vma

while holding page lock (with KSM needing that page force-deleted

from swap cache before moving on); but none of us have written

that code yet, maybe just removing the retries limit good enough.



Two dislikes on the code structure, probably one solution: the

"goto retry", up two levels from inside the lower loop, is easy to

misunderstand; and the oldi business is ugly - find_next_to_unuse()

was written to wrap around continuously to suit the old loop, but

now it's left with its "++i >= max" code to achieve that, then your

"i <= oldi" code to detect when it did, to undo that again: please

delete code from both ends to make that all simpler.



I'd expect to see checks on inuse_pages in some places, to terminate

the scans as soon as possible (swapoff of an unused swapfile should

be very quick, shouldn't it? not requiring any scans at all); but it

looks like the old code did not have those either - was inuse_pages

unreliable once upon a time? is it unreliable now?



Hugh



---



 mm/shmem.c    |   12 ++++++++----

 mm/swapfile.c |    8 ++++----

 2 files changed, 12 insertions(+), 8 deletions(-)

--- mmotm/mm/shmem.c    2018-12-22 13:32:51.339584848 -0800

+++ linux/mm/shmem.c    2018-12-31 12:30:55.822407154 -0800

@@ -1149,6 +1149,7 @@ static int shmem_unuse_swap_entries(stru

                }

                if (error == -ENOMEM)

                        break;

+               error = 0;

        }

        return error;

 }

@@ -1216,12 +1217,15 @@ int shmem_unuse(unsigned int type)

                mutex_unlock(&shmem_swaplist_mutex);

                if (prev_inode)

                        iput(prev_inode);

+               prev_inode = inode;

+

                error = shmem_unuse_inode(inode, type);

-               if (!info->swapped)

-                       list_del_init(&info->swaplist);

                cond_resched();

-               prev_inode = inode;

+

                mutex_lock(&shmem_swaplist_mutex);

+               next = list_next_entry(info, swaplist);

+               if (!info->swapped)

+                       list_del_init(&info->swaplist);

                if (error)

                        break;

        }

@@ -1313,7 +1317,7 @@ static int shmem_writepage(struct page *

         */

        mutex_lock(&shmem_swaplist_mutex);

        if (list_empty(&info->swaplist))

-               list_add_tail(&info->swaplist, &shmem_swaplist);

+               list_add(&info->swaplist, &shmem_swaplist);



        if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) {

                spin_lock_irq(&info->lock);

diff -purN mmotm/mm/swapfile.c linux/mm/swapfile.c

--- mmotm/mm/swapfile.c 2018-12-22 13:32:51.347584880 -0800

+++ linux/mm/swapfile.c 2018-12-31 12:30:55.822407154 -0800

@@ -2156,7 +2156,7 @@ retry:



        while ((i = find_next_to_unuse(si, i, frontswap)) != 0) {

                /*

-                * under global memory pressure, swap entries

+                * Under global memory pressure, swap entries

                 * can be reinserted back into process space

                 * after the mmlist loop above passes over them.

                 * This loop will then repeat fruitlessly,

@@ -2164,7 +2164,7 @@ retry:

                 * but doing nothing to actually free up the swap.

                 * In this case, go over the mmlist loop again.

                 */

-               if (i < oldi) {

+               if (i <= oldi) {

                        retries++;

                        if (retries > MAX_RETRIES) {

                                retval = -EBUSY;

@@ -2172,8 +2172,9 @@ retry:

                        }

                        goto retry;

                }

+               oldi = i;

                entry = swp_entry(type, i);

-               page = find_get_page(swap_address_space(entry), entry.val);

+               page = find_get_page(swap_address_space(entry), i);

                if (!page)

                        continue;



@@ -2188,7 +2189,6 @@ retry:

                try_to_free_swap(page);

                unlock_page(page);

                put_page(page);

-               oldi = i;

        }

 out:

        return retval;