On Sat, Mar 25, 2023 at 9:33 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote: > > I actually didn't follow what you meant by "mutually PMD-aligned". Could you > provide some example address numbers to explain? Sure, let me make this more clear with a couple of concrete examples. Let's say that we have a range '[old, old+len]' and we want to remap it to '[new, new+len]'. Furthermore, we'll say that overlapping is fine, but because we're always moving pages from "low address to high", we only allow overlapping when we're moving things down (ie 'new < old'). And yes, I know that the overlapping case cannot actually happen with mremap() itself. So in practice the overlapping case only happens for the special "move the stack pages" around at execve() startup, but let's ignore that for now. So we'll talk about the generic "move pages around" case, not the more limited 'mremap()' case. I'll also simplify the thing to just assume that we have that CONFIG_HAVE_MOVE_PMD enabled, so I'll ignore some of the full grotty details. Ok? So let's take the simple case first: (a) everything is PMD aligned, so all of old, new, and len are multiples of the PMD size. This is the simple case, because we can just do the whole move by moving the PMD entries, and we'll never hit any issues. As we move the PMD entries in move_normal_pmd(), we always remove the entry we are moving down: /* Clear the pmd */ pmd = *old_pmd; pmd_clear(old_pmd); so as we move up in the destination, we will never hit a populated PMD entry as we walk the page tables. So (a) is fine today, everybody is happy, we have no issues. It's efficient and simple. The other simple case is (b) we're *not* PMD-aligned, and everything is done one page at a time. This is basically the exact same thing as (a), except rather than move the PMD entry around, we move a PTE entry, and we do the exact same "clear the entry as we move it" in move_ptes(), except (due to our locking rules being different), it now looks like this instead: pte = ptep_get_and_clear(mm, old_addr, old_pte); but otherwise it's all the same as (a), just one level lower. But then we have case (c): the "mutually aligned" case: > AFAIK, only 2MB aligned memory addresses can be directly addressed by a PMD. > Otherwise how will you index the bytes in the 2MB page? You need bits in the > address for that. The thing is, the problematic case is when 'old' and 'new' are not initially themselves PMD-aligned (so they have lower bits set), but they are *mutually* aligned, so they have the *same* lower bits set. So in this case (c), we start off with case (b), but as we walk the address one page at a time, at some point we suddenly hit case (a) in the middle. To make this really concrete, lets say that we have old = 0x1fff000 new = 0x0fff000 len = 0x201000 and we have PMD_SIZE being 0x200000. Notice how 'old' and 'new' are *not* PMD-aligned, but they are *mutually* aligned in the PMD size (ie "old & (PMD_SIZE-1)" and "new & (PMD_SIZE-1)" are the same). Look at what happens: we start with the unaligned case, and we move one single page, from address 0x1fff000 to 0x0fff000. But what happens after we moved that page? We'll update old/new/len, and now, we'll have old = 0x2000000 new = 0x1000000 len = 0x200000 and now evertthing *is* PMD-aligned, and we just move a single PMD entry from 0x2000000 to 0x1000000. And then we're done. And that's fine, and won't cause any problems, because the area wasn't overlapping at a PMD level, so we're all good. But it it *was* overlapping, and we started out with old = 0x1fff000 new = 1dff000 len = 0x201000 instead, we start out with the exact same thing, but after moving one page, we'll have old = 0x2000000 new = 0x1e00000 len = 0x200000 and now when we try to move the PMD entry from 0x2000000 to 0x1e00000, we'll hit that *old* (and empty) PMD entry that used to contain that one single page that we moved earlier. And notice how for this all to happen, the old/new addresses have to start out mutually aligned. So we can see the dangerous case up-front: even if old and new aren't PMD-aligned to start with, you can see the low bits are the same: (old ^ new) & (PMD_SIZE-1) == 0 because as we move pages around, we'll always be updating old/new together by the same amount. So what I'm saying is that *if* we start out with that situation, and we have that old = 0x1fff000 new = 1dff000 len = 0x201000 we could easily decode "let's just move the whole PMD", and expand the move to be old = 0x1e00000 new = 0x1c00000 len = 0x400000 instead. And then instead of moving PTE's around at first, we'd move PMD's around *all* the time, and turn this into that "simple case (a)". NOTE! For this to work, there must be no mapping right below 'old' or 'new', of course. But during the execve() startup, that should be trivially true. See what I'm saying? Linus