Hi Jason, Thank you for your comments. My replies below. On Thu, Dec 17, 2020 at 3:50 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote: > > On Thu, Dec 17, 2020 at 01:52:41PM -0500, Pavel Tatashin wrote: > > +/* > > + * Verify that there are no unpinnable (movable) pages, if so return true. > > + * Otherwise an unpinnable pages is found return false, and unpin all pages. > > + */ > > +static bool check_and_unpin_pages(unsigned long nr_pages, struct page **pages, > > + unsigned int gup_flags) > > +{ > > + unsigned long i, step; > > + > > + for (i = 0; i < nr_pages; i += step) { > > + struct page *head = compound_head(pages[i]); > > + > > + step = compound_nr(head) - (pages[i] - head); > > You can't assume that all of a compound head is in the pages array, > this assumption would only work inside the page walkers if the page > was found in a PMD or something. I am not sure I understand your comment. The compound head is not taken from the pages array, and not assumed to be in it. It is exactly the same logic as that we currently have: https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1565 > > > + if (gup_flags & FOLL_PIN) { > > + unpin_user_pages(pages, nr_pages); > > So we throw everything away? Why? That isn't how the old algorithm worked It is exactly like the old algorithm worked: if there are pages to be migrated (not pinnable pages) we unpinned everything. See here: https://soleen.com/source/xref/linux/mm/gup.c?r=a00cda3f#1603 If cma_pages_list is not empty unpin everything. The list is not empty if we isolated some pages, we isolated some pages if there are some pages which are not pinnable. Now, we do exactly the same thing, but cleaner, and handle errors. We must unpin everything because if we fail, no pages should stay pinned, and also if we migrated some pages, the pages array must be updated, so we need to call __get_user_pages_locked() pin and repopulated pages array. > > > @@ -1654,22 +1664,55 @@ static long __gup_longterm_locked(struct mm_struct *mm, > > struct vm_area_struct **vmas, > > unsigned int gup_flags) > > { > > - unsigned long flags = 0; > > + int migrate_retry = 0; > > + int isolate_retry = 0; > > + unsigned int flags; > > long rc; > > > > - if (gup_flags & FOLL_LONGTERM) > > - flags = memalloc_pin_save(); > > + if (!(gup_flags & FOLL_LONGTERM)) > > + return __get_user_pages_locked(mm, start, nr_pages, pages, vmas, > > + NULL, gup_flags); > > > > - rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL, > > - gup_flags); > > + /* > > + * Without FOLL_WRITE fault handler may return zero page, which can > > + * be in a movable zone, and also will fail to isolate during migration, > > + * thus the longterm pin will fail. > > + */ > > + gup_flags &= FOLL_WRITE; > > Is &= what you mean here? |= right? Right, I meant |=. > > Seems like we've ended up in a weird place if FOLL_LONGTERM always > includes FOLL_WRITE. Putting the zero page in ZONE_MOVABLE seems like > a bad idea, no? I am not sure, I just found this problem during testing, and this is the solution I am proposing. I am worried about limiting the zero page to a non movable zone, but let's see what others think about this. > > > + /* > > + * Migration may fail, we retry before giving up. Also, because after > > + * migration pages[] becomes outdated, we unpin and repin all pages > > + * in the range, so pages array is repopulated with new values. > > + * Also, because of this we cannot retry migration failures in a loop > > + * without pinning/unpinnig pages. > > + */ > > The old algorithm made continuous forward progress and only went back > to the first migration point. That is not right, the old code went back to the beginning. Making continuous progress is possible, but we won't see any performance betnefit from it, because migration failures is already exception scenarios where machine is under memory stress. The truth is if we fail to migrate it is unlikely will succeed if we retry right away, so giving some time between retries may be even beneficial. Also with continious progress we need to take care of some corner cases where we need to unpin already succeeded pages in case if forward progress is not possible. Also, adjust pages array, start address etc. > > > + for (; ; ) { > > while (true)? Hm, the same thing? :) > > > + rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, > > + NULL, gup_flags); > > > + /* Return if error or if all pages are pinnable */ > > + if (rc <= 0 || check_and_unpin_pages(rc, pages, gup_flags)) > > + break; > > So we sweep the pages list twice now? Yes, but O(N) is the same. No new operation is added. Before we had something like this: while (npages) check if pinnable isolate while (npages) unpin migrate while (npages) pin Now: while(npages) check if pinnable while(npages) unpin while(npages) isolate migrate pin > > > + /* Some pages are not pinnable, migrate them */ > > + rc = migrate_movable_pages(rc, pages); > > + > > + /* > > + * If there is an error, and we tried maximum number of times > > + * bail out. Notice: we return an error code, and all pages are > > + * unpinned > > + */ > > + if (rc < 0 && migrate_retry++ >= PINNABLE_MIGRATE_MAX) { > > + break; > > + } else if (rc > 0 && isolate_retry++ >= PINNABLE_ISOLATE_MAX) { > > + rc = -EBUSY; > > I don't like this at all. It shouldn't be so flakey > > Can you do migration without the LRU? I do not think it is possible, we must isolate pages before migration. Pasha