On 3/11/22 09:54, Hugh Dickins wrote: > On Wed, 9 Mar 2022, Vlastimil Babka wrote: >> On 3/8/22 22:32, Hugh Dickins wrote: >> > On Tue, 8 Mar 2022, Liam Howlett wrote: >> >> >> >> I must be missing something. If mpol_equal() isn't sufficient to ensure >> >> we don't need to set_policy(), then why are the other vma_merge() cases >> >> okay - such as madvise_update_vma() and mlock_fixup()? Won't the mem >> >> policy change in the same way in these cases? >> > >> > mlock provides a good example to compare. >> > >> > Mlocking pages is the business of mlock(), and mlock_fixup() needs to >> > attend to mm->locked_vm, and calling something to mark as PageMlocked >> > those pages already in the area now covered by mlock. But it doesn't >> > need to worry about set_policy(), that's not its business, and is >> > unaffected by mlock changes (though merging of vmas needs mpol_equal() >> > to check that policy is the same, and merging and splitting of vmas >> > need to maintain the refcount of the shared policy if any). >> > >> > Whereas NUMA mempolicy is the business of mbind(), and mbind_range() >> > needs to attend to vma->vm_policy, and if it's a mapping of something >> > supporting a shared set_policy(), call that to establish the new range >> > on the object mapped. But it doesn't need to worry about mm->locked_vm >> > or whether pages are Mlocked, that's not its business, and is unaffected >> > by mbind changes (though merging of vmas needs to check VM_LOCKED among >> > other flags to check that they are the same before it can merge). >> >> So if I understand correctly, we have case 8 of vma_merge(): >> >> AAAA >> PPPPNNNNXXXX >> becomes >> PPPPXXXXXXXX 8 >> >> N is vma with some old policy different from new_pol >> A is the range where we change to new policy new_pol, which happens to be >> the same as existing policy of X >> Thus vma_merge() extends vma X to include range A - the vma N >> vma_merge() succeeds because it's passed new_pol to do the compatibility >> checks (although N still has the previous policy) > > I *think* you have it the wrong way round there: my reading is that > this vma_merge() case 8 was correctly handled before, because in its > case !mpol_equal(vma_policy(vma), new_pol): I think case 8 was being > handled correctly, but the other cases were not. > > Or was the comment even correct to reference case 8 especially? I think it wasn't, but... > I'm afraid bringing it all back to mind is a bit of an effort: I won't > stake my life on it, perhaps I'm the one who has it the wrong way round. ... same here. Importantly I believe your patch is the correct solution. >> >> Before Hugh's patch we would then realize "oh X already has new_pol, nothing >> to do". Note that this AFAICS doesn't affect actual pages migration between >> nodes, because that happens outside of mbind_range(). But it causes us to >> skip vma_replace_policy(), which causes us to skip vm_ops->set_policy, where >> tmpfs does something important (we could maybe argue that Hugh didn't >> specify the user visible effects of this exactly enough :) what is "leaving >> the new mbind unenforced" - are pages not migrated in this case?). > > Went back to check the original (internal) report: > mbind MPOL_BIND on tmpfs can result in allocations on the wrong node. > And it was a genuine practical case, though the finder was kind enough > to distil it down to a minimal sequence (and correctly suggest the fix). > > The user visible effect was that the pages got allocated on the local node > (happened to be 0), after the mbind() caller had specifically asked for > them to be allocated on node 1. There was not any page migration involved > in the case reported: the pages simply got allocated on the wrong node. That's useful, thanks. > And yes, on this patch I should have asked for a > Cc: <stable@xxxxxxxxxxxxxxx> Agree. Andrew can add it, and also the user visible effects above? Thanks, Vlastimil >> >> HTH (if I'm right), >> Vlastimil