Sorry for the late reply, I was a bit busy, and needed some time to digest your email. On Tue, Aug 27, 2024 at 12:01:28PM -0400, Liam R. Howlett wrote: > * Nam Cao <namcao@xxxxxxxxxxxxx> [240827 03:59]: > > On Mon, Aug 26, 2024 at 09:58:11AM -0400, Liam R. Howlett wrote: > > > * Nam Cao <namcao@xxxxxxxxxxxxx> [240825 11:29]: > > > > When a virtual memory area (VMA) gets splitted, memtype_rbroot's entries > > > > are not updated. This causes confusion later on when the VMAs get > > > > un-mapped, because the address ranges of the splitted VMAs do not match the > > > > address range of the initial VMA. > > > > > > > > For example, if user does: > > > > > > > > fd = open("/some/pci/bar", O_RDWR); > > > > addr = mmap(0, 8192, PROT_READ, MAP_SHARED, fd, 0); > > > > mprotect(addr, 4096, PROT_READ | PROT_WRITE); > > > > munmap(p, 8192); > > What is p? By the comments below, you mean addr here? Yes, it should be addr. Sorry about that. > > > > > > > > > with the physical address starting from 0xfd000000, the range > > > > (0xfd000000-0xfd002000) would be tracked with the mmap() call. > > > > > > > > After mprotect(), the initial range gets splitted into > > > > (0xfd000000-0xfd001000) and (0xfd001000-0xfd002000). > > > > > > > > Then, at munmap(), the first range does not match any entry in > > > > memtype_rbroot, and a message is seen in dmesg: > > > > > > > > x86/PAT: test:177 freeing invalid memtype [mem 0xfd000000-0xfd000fff] > > > > > > > > The second range still matches by accident, because matching only the end > > > > address is acceptable (to handle shrinking VMA, added by 2039e6acaf94 > > > > (x86/mm/pat: Change free_memtype() to support shrinking case)). > > > > > > Does this need a fixes tag? > > > > Yes, it should have > > Fixes: 2e5d9c857d4e ("x86: PAT infrastructure patch") > > thanks for the reminder. > > That commit is from 2008, is there a bug report on this issue? Not that I am aware of. I'm not entirely sure why, but I would guess due to the combination of: - This is not an issue for pages in RAM - This only happens if VMAs are splitted - The only user-visible effect is merely a pr_info(), and people may miss it. I only encountered this issue while "trying to be smart" with mprotect() on a portion of mmap()-ed device memory, I guess probably not many people do that. > > > > > > > > > > > > > > Make sure VMA splitting is handled properly, by splitting the entries in > > > > memtype_rbroot. > > > > > > > > Signed-off-by: Nam Cao <namcao@xxxxxxxxxxxxx> > > > > --- > > > > arch/x86/mm/pat/memtype.c | 59 ++++++++++++++++++++++++++++++ > > > > arch/x86/mm/pat/memtype.h | 3 ++ > > > > arch/x86/mm/pat/memtype_interval.c | 22 +++++++++++ > > > > include/linux/pgtable.h | 6 +++ > > > > mm/mmap.c | 8 ++++ > > > > 5 files changed, 98 insertions(+) > > > > > ... > > > > > > > It is also a bit odd that you check VM_PFNMAP() here, then call a > > > function to check another flag? > > > > Right, this check is redundant, thanks for pointing it out. > > > > I stole this "style" from unmap_single_vma(), but I think the check is > > redundant there as well. > > If you have identified a redundant check, can you please remove it with > a separate patch? Sure. > > > > > > > > > > + err = track_pfn_split(vma, addr); > > > > + if (err) > > > > + goto out_vma_unlink; > > > > + } > > > > + > > > > > > I don't think the __split_vma() location is the best place to put this. > > > Can this be done through the vm_ops->may_split() that is called above? > > > > I don't think ->may_split() is a suitable place. Its name gives me the > > impression that it only checks whether it is okay to split the VMA, but not > > really does any splitting work. Also that function pointer can be > > overwritten by any driver. > > It's a callback that takes the arguments you need and is called as long > as it exists. Your function would deny splitting if it failed, so it > may not split in that case. > > Also, any driver that overwrites it should do what is necessary for PAT > then. I don't love the idea of using the vm_ops either, I just like it > better than dropping in flag checks and arch-specific code. I can see > issue with using the callback and drivers that may have their own vma > mapping that also use PAT, I guess. Yeah I don't love this. You mentioned another approach below, which I think would be the best (if it's possible). I will attempt that other approach. > > > > > > > This is arch independent code that now has an x86 specific check, and > > > I'd like to keep __split_vma() out of the flag checking. > > > > I think these track_pfn_*() functions are meant to be arch-independent, > > it's just that only x86 implements it at the moment. For instance, > > untrack_pfn() and track_pfn_remap() are called in mm/ code. > > > > Arch-independent wrappers that are only used by one arch are not > arch-independent. PAT has been around for ages and only exists for x86 > and x86_64. > > We just went through removing arch_unmap(), which was used just for ppc. > They cause problems for general mm changes and just get in the way. If > we can avoid them, we should. > > memtype_interval.c doesn't have any knowledge of the vmas, so you have > this extraction layer in memtype.c that is being bypassed here for the > memtype_erase(); ensuring the start-end match or at least the end > matches. > > So your comment about the second range still matching by accident is > misleading - it's not matched at all because you are searching for the > exact match or the end address being the same (which it isn't in your > interval tree). But the second range *does* match, because the end address match? The second range is (0xfd001000-0xfd002000), which matches with (0xfd000000-0xfd002000) in the interval tree. Perhaps I should be clearer in the description.. > > Taking a step back here, you are splitting a range in an interval tree > to match a vma split, but you aren't splitting the range based on PAT > changing; you are splitting it based on the vma becoming two vmas. > > Since VM_PFNMAP is in VM_SPECIAL, the splitting is never undone and will > continue to fragment the interval tree, so even if flags change back to > match each other there will always be two vams - and what changed may > not even be the PAT. Right, I did not consider this scenario. > > So the interval split should occur when the PAT changes and needs to be > tracked differently. This does not happen when the vma is split - it > happens when a vma is removed or when the PAT is changed. > > And, indeed, for the mremap() shrinking case, you already support > finding a range by just the end and have an abstraction layer. The > problem here is that you don't check by the start - but you could. You > could make the change to memtype_erase() to search for the exact, end, > or start and do what is necessary to shrink off the front of a region as > well. I thought about this solution initially, but since the interval tree allow overlapping ranges, it can be tricky to determine the "best match" out of the overlapping ranges. But I agree that this approach (if possible) would be better than the current patch. Let me think about this some more, and I will come back later. > > What I find very strange is that 2039e6acaf94 ("x86/mm/pat: Change > free_memtype() to support shrinking case") enables shrinking of > VM_PFNMAP, but doesn't allow shrinking the end address. Why is one > allowed and the other not allowed? Not really sure what you mean. I think you are ultimately asking why that commit only matches end address, and not start address? That's because mremap() may shrink a VMA from [start, end] to [start, new_end] (with new_end < end). In that case, the range [new_end, end] would be removed from the interval tree, and that commit wants to match [new_end, end] to [start, end]. And I don't think mremap() can shrink [start, end] to [new_start, end]? Thanks for sharing your thoughts. Best regards, Nam