On Fri, Sep 15, 2023 at 9:09 AM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Thu, Sep 14, 2023 at 9:26 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > > > On Thu, 14 Sep 2023, Suren Baghdasaryan wrote: > > > On Thu, Sep 14, 2023 at 9:24 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > On Thu, Sep 14, 2023 at 08:53:59PM +0000, Suren Baghdasaryan wrote: > > > > > On Thu, Sep 14, 2023 at 8:00 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > > > > > On Thu, Sep 14, 2023 at 7:09 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > On Thu, Sep 14, 2023 at 06:20:56PM +0000, Suren Baghdasaryan wrote: > > > > > > > > I think I found the problem and the explanation is much simpler. While > > > > > > > > walking the page range, queue_folios_pte_range() encounters an > > > > > > > > unmovable page and queue_folios_pte_range() returns 1. That causes a > > > > > > > > break from the loop inside walk_page_range() and no more VMAs get > > > > > > > > locked. After that the loop calling mbind_range() walks over all VMAs, > > > > > > > > even the ones which were skipped by queue_folios_pte_range() and that > > > > > > > > causes this BUG assertion. > > > > > > > > > > > > > > > > Thinking what's the right way to handle this situation (what's the > > > > > > > > expected behavior here)... > > > > > > > > I think the safest way would be to modify walk_page_range() and make > > > > > > > > it continue calling process_vma_walk_lock() for all VMAs in the range > > > > > > > > even when __walk_page_range() returns a positive err. Any objection or > > > > > > > > alternative suggestions? > > > > > > > > > > > > > > So we only return 1 here if MPOL_MF_MOVE* & MPOL_MF_STRICT were > > > > > > > specified. That means we're going to return an error, no matter what, > > > > > > > and there's no point in calling mbind_range(). Right? > > > > > > > > > > > > > > +++ b/mm/mempolicy.c > > > > > > > @@ -1334,6 +1334,8 @@ static long do_mbind(unsigned long start, unsigned long len, > > > > > > > ret = queue_pages_range(mm, start, end, nmask, > > > > > > > flags | MPOL_MF_INVERT, &pagelist, true); > > > > > > > > > > > > > > + if (ret == 1) > > > > > > > + ret = -EIO; > > > > > > > if (ret < 0) { > > > > > > > err = ret; > > > > > > > goto up_out; > > > > > > > > > > > > > > (I don't really understand this code, so it can't be this simple, can > > > > > > > it? Why don't we just return -EIO from queue_folios_pte_range() if > > > > > > > this is the right answer?) > > > > > > > > > > > > Yeah, I'm trying to understand the expected behavior of this function > > > > > > to make sure we are not missing anything. I tried a simple fix that I > > > > > > suggested in my previous email and it works but I want to understand a > > > > > > bit more about this function's logic before posting the fix. > > > > > > > > > > So, current functionality is that after queue_pages_range() encounters > > > > > an unmovable page, terminates the loop and returns 1, mbind_range() > > > > > will still be called for the whole range > > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1345), > > > > > all pages in the pagelist will be migrated > > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1355) > > > > > and only after that the -EIO code will be returned > > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1362). > > > > > So, if we follow Matthew's suggestion we will be altering the current > > > > > behavior which I assume is not what we want to do. > > > > > > > > Right, I'm intentionally changing the behaviour. My thinking is > > > > that mbind(MPOL_MF_MOVE | MPOL_MF_STRICT) is going to fail. Should > > > > such a failure actually move the movable pages before reporting that > > > > it failed? I don't know. > > > > > > > > > The simple fix I was thinking about that would not alter this behavior > > > > > is smth like this: > > > > > > > > I don't like it, but can we run it past syzbot to be sure it solves the > > > > issue and we're not chasing a ghost here? > > > > > > Yes, I just finished running the reproducer on both upstream and > > > linux-next builds listed in > > > https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 and the > > > problem does not happen anymore. > > > I'm fine with your suggestion too, just wanted to point out it would > > > introduce change in the behavior. Let me know how you want to proceed. > > > > Well done, identifying the mysterious cause of this problem: > > I'm glad to hear that you've now verified that hypothesis. > > > > You're right, it would be a regression to follow Matthew's suggestion. > > > > Traditionally, modulo bugs and inconsistencies, the queue_pages_range() > > phase of do_mbind() has done the best it can, gathering all the pages it > > can that need migration, even if some were missed; and proceeds to do the > > mbind_range() phase if there was nothing "seriously" wrong (a gap causing > > -EFAULT). Then at the end, if MPOL_MF_STRICT was set, and not all the > > pages could be migrated (or MOVE was not specified and not all pages > > were well placed), it returns -EIO rather than 0 to inform the caller > > that not all could be done. > > > > There have been numerous tweaks, but I think most importantly > > 5.3's d883544515aa ("mm: mempolicy: make the behavior consistent when > > MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") added those "return 1"s > > which stop the pagewalk early. In my opinion, not an improvement - makes > > it harder to get mbind() to do the best job it can (or is it justified as > > what you're asking for if you say STRICT?). > > > > But whatever, it would be a further regression for mbind() not to have > > done the mbind_range(), even though it goes on to return -EIO. > > > > I had a bad first reaction to your walk_page_range() patch (was expecting > > to see vma_start_write()s in mbind_range()), but perhaps your patch is > > exactly what process_mm_walk_lock() does now demand. > > > > [Why is Hugh responding on this? Because I have some long-standing > > mm/mempolicy.c patches to submit next week, but in reviewing what I > > could or could not afford to get into at this time, had decided I'd > > better stay out of queue_pages_range() for now - beyond the trivial > > preferring an MPOL_MF_WRLOCK flag to your bool lock_vma.] > > Thanks for the feedback, Hugh! > Yeah, this positive err handling is kinda weird. If this behavior (do > as much as possible even if we fail eventually) is specific to mbind() > then we could keep walk_page_range() as is and lock the VMAs inside > the loop that calls mbind_range() with a condition that ret is > positive. That would be the simplest solution IMHO. But if we expect > walk_page_range() to always apply requested page_walk_lock policy to > all VMAs even if some mm_walk_ops returns a positive error somewhere > in the middle of the walk then my fix would work for that. So, to me > the important question is how we want walk_page_range() to behave in > these conditions. I think we should answer that first and document > that. Then the fix will be easy. I looked at all the cases where we perform page walk while locking VMAs and mbind() seems to be the only one that would require walk_page_range() to lock all VMAs even for a failed walk. So, I suggest this fix instead and I can also document that if walk_page_range() fails it might not apply page_walk_lock policy to the VMAs. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 42b5567e3773..cbc584e9b6ca 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1342,6 +1342,9 @@ static long do_mbind(unsigned long start, unsigned long len, vma_iter_init(&vmi, mm, start); prev = vma_prev(&vmi); for_each_vma_range(vmi, vma, end) { + /* If queue_pages_range failed then not all VMAs might be locked */ + if (ret) + vma_start_write(vma); err = mbind_range(&vmi, vma, &prev, start, end, new); if (err) break; If this looks good I'll post the patch. Matthew, Hugh, anyone else? > > > > > > Hugh