Re: [PATCH] x86/mm/pat: Support splitting of virtual memory areas

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry for the late reply, I was a bit busy, and needed some time to digest
your email.

On Tue, Aug 27, 2024 at 12:01:28PM -0400, Liam R. Howlett wrote:
> * Nam Cao <namcao@xxxxxxxxxxxxx> [240827 03:59]:
> > On Mon, Aug 26, 2024 at 09:58:11AM -0400, Liam R. Howlett wrote:
> > > * Nam Cao <namcao@xxxxxxxxxxxxx> [240825 11:29]:
> > > > When a virtual memory area (VMA) gets splitted, memtype_rbroot's entries
> > > > are not updated. This causes confusion later on when the VMAs get
> > > > un-mapped, because the address ranges of the splitted VMAs do not match the
> > > > address range of the initial VMA.
> > > > 
> > > > For example, if user does:
> > > > 
> > > > 	fd = open("/some/pci/bar", O_RDWR);
> > > > 	addr = mmap(0, 8192, PROT_READ, MAP_SHARED, fd, 0);
> > > > 	mprotect(addr, 4096, PROT_READ | PROT_WRITE);
> > > > 	munmap(p, 8192);
> 
> What is p?  By the comments below, you mean addr here?
Yes, it should be addr. Sorry about that.

> 
> > > > 
> > > > with the physical address starting from 0xfd000000, the range
> > > > (0xfd000000-0xfd002000) would be tracked with the mmap() call.
> > > > 
> > > > After mprotect(), the initial range gets splitted into
> > > > (0xfd000000-0xfd001000) and (0xfd001000-0xfd002000).
> > > > 
> > > > Then, at munmap(), the first range does not match any entry in
> > > > memtype_rbroot, and a message is seen in dmesg:
> > > > 
> > > >     x86/PAT: test:177 freeing invalid memtype [mem 0xfd000000-0xfd000fff]
> > > > 
> > > > The second range still matches by accident, because matching only the end
> > > > address is acceptable (to handle shrinking VMA, added by 2039e6acaf94
> > > > (x86/mm/pat: Change free_memtype() to support shrinking case)).
> > > 
> > > Does this need a fixes tag?
> > 
> > Yes, it should have
> > 	Fixes: 2e5d9c857d4e ("x86: PAT infrastructure patch")
> > thanks for the reminder.
> 
> That commit is from 2008, is there a bug report on this issue?

Not that I am aware of. I'm not entirely sure why, but I would guess due to
the combination of:
- This is not an issue for pages in RAM
- This only happens if VMAs are splitted
- The only user-visible effect is merely a pr_info(), and people may miss it.

I only encountered this issue while "trying to be smart" with mprotect() on
a portion of mmap()-ed device memory, I guess probably not many people do
that.

> 
> > 
> > > 
> > > > 
> > > > Make sure VMA splitting is handled properly, by splitting the entries in
> > > > memtype_rbroot.
> > > > 
> > > > Signed-off-by: Nam Cao <namcao@xxxxxxxxxxxxx>
> > > > ---
> > > >  arch/x86/mm/pat/memtype.c          | 59 ++++++++++++++++++++++++++++++
> > > >  arch/x86/mm/pat/memtype.h          |  3 ++
> > > >  arch/x86/mm/pat/memtype_interval.c | 22 +++++++++++
> > > >  include/linux/pgtable.h            |  6 +++
> > > >  mm/mmap.c                          |  8 ++++
> > > >  5 files changed, 98 insertions(+)
> > > > 
> ...
> 
> > > 
> > > It is also a bit odd that you check VM_PFNMAP() here, then call a
> > > function to check another flag?
> > 
> > Right, this check is redundant, thanks for pointing it out.
> > 
> > I stole this "style" from unmap_single_vma(), but I think the check is
> > redundant there as well.
> 
> If you have identified a redundant check, can you please remove it with
> a separate patch?

Sure.

> 
> > 
> > > 
> > > > +		err = track_pfn_split(vma, addr);
> > > > +		if (err)
> > > > +			goto out_vma_unlink;
> > > > +	}
> > > > +
> > > 
> > > I don't think the __split_vma() location is the best place to put this.
> > > Can this be done through the vm_ops->may_split() that is called above?
> > 
> > I don't think ->may_split() is a suitable place. Its name gives me the
> > impression that it only checks whether it is okay to split the VMA, but not
> > really does any splitting work. Also that function pointer can be
> > overwritten by any driver.
> 
> It's a callback that takes the arguments you need and is called as long
> as it exists.  Your function would deny splitting if it failed, so it
> may not split in that case.
> 
> Also, any driver that overwrites it should do what is necessary for PAT
> then. I don't love the idea of using the vm_ops either, I just like it
> better than dropping in flag checks and arch-specific code.  I can see
> issue with using the callback and drivers that may have their own vma
> mapping that also use PAT, I guess.

Yeah I don't love this. You mentioned another approach below, which I
think would be the best (if it's possible). I will attempt that other
approach.

> 
> > > 
> > > This is arch independent code that now has an x86 specific check, and
> > > I'd like to keep __split_vma() out of the flag checking.
> > 
> > I think these track_pfn_*() functions are meant to be arch-independent,
> > it's just that only x86 implements it at the moment. For instance,
> > untrack_pfn() and track_pfn_remap() are called in mm/ code.
> > 
> 
> Arch-independent wrappers that are only used by one arch are not
> arch-independent.  PAT has been around for ages and only exists for x86
> and x86_64.
> 
> We just went through removing arch_unmap(), which was used just for ppc.
> They cause problems for general mm changes and just get in the way.  If
> we can avoid them, we should.
> 
> memtype_interval.c doesn't have any knowledge of the vmas, so you have
> this extraction layer in memtype.c that is being bypassed here for the
> memtype_erase(); ensuring the start-end match or at least the end
> matches.
> 
> So your comment about the second range still matching by accident is
> misleading - it's not matched at all because you are searching for the
> exact match or the end address being the same (which it isn't in your
> interval tree).

But the second range *does* match, because the end address match?
The second range is (0xfd001000-0xfd002000), which matches with
(0xfd000000-0xfd002000) in the interval tree.

Perhaps I should be clearer in the description..

> 
> Taking a step back here, you are splitting a range in an interval tree
> to match a vma split, but you aren't splitting the range based on PAT
> changing; you are splitting it based on the vma becoming two vmas.
> 
> Since VM_PFNMAP is in VM_SPECIAL, the splitting is never undone and will
> continue to fragment the interval tree, so even if flags change back to
> match each other there will always be two vams - and what changed may
> not even be the PAT.

Right, I did not consider this scenario.

> 
> So the interval split should occur when the PAT changes and needs to be
> tracked differently.  This does not happen when the vma is split - it
> happens when a vma is removed or when the PAT is changed.
> 
> And, indeed, for the mremap() shrinking case, you already support
> finding a range by just the end and have an abstraction layer.  The
> problem here is that you don't check by the start - but you could.  You
> could make the change to memtype_erase() to search for the exact, end,
> or start and do what is necessary to shrink off the front of a region as
> well.

I thought about this solution initially, but since the interval tree allow
overlapping ranges, it can be tricky to determine the "best match" out
of the overlapping ranges. But I agree that this approach (if possible)
would be better than the current patch.

Let me think about this some more, and I will come back later.

> 
> What I find very strange is that 2039e6acaf94 ("x86/mm/pat: Change
> free_memtype() to support shrinking case") enables shrinking of
> VM_PFNMAP, but doesn't allow shrinking the end address.  Why is one
> allowed and the other not allowed?

Not really sure what you mean. I think you are ultimately asking why that
commit only matches end address, and not start address? That's because
mremap() may shrink a VMA from [start, end] to [start, new_end] (with
new_end < end). In that case, the range [new_end, end] would be removed
from the interval tree, and that commit wants to match [new_end, end] to
[start, end].
And I don't think mremap() can shrink [start, end] to [new_start, end]?

Thanks for sharing your thoughts.

Best regards,
Nam




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux