* Nam Cao <namcao@xxxxxxxxxxxxx> [240903 06:36]: > Sorry for the late reply, I was a bit busy, and needed some time to digest > your email. No problem. > > On Tue, Aug 27, 2024 at 12:01:28PM -0400, Liam R. Howlett wrote: > > * Nam Cao <namcao@xxxxxxxxxxxxx> [240827 03:59]: > > > On Mon, Aug 26, 2024 at 09:58:11AM -0400, Liam R. Howlett wrote: > > > > * Nam Cao <namcao@xxxxxxxxxxxxx> [240825 11:29]: > > > > > When a virtual memory area (VMA) gets splitted, memtype_rbroot's entries > > > > > are not updated. This causes confusion later on when the VMAs get > > > > > un-mapped, because the address ranges of the splitted VMAs do not match the > > > > > address range of the initial VMA. > > > > > > > > > > For example, if user does: > > > > > > > > > > fd = open("/some/pci/bar", O_RDWR); > > > > > addr = mmap(0, 8192, PROT_READ, MAP_SHARED, fd, 0); > > > > > mprotect(addr, 4096, PROT_READ | PROT_WRITE); > > > > > munmap(p, 8192); > > > > What is p? By the comments below, you mean addr here? > Yes, it should be addr. Sorry about that. > > > > > > > > > > > > > with the physical address starting from 0xfd000000, the range > > > > > (0xfd000000-0xfd002000) would be tracked with the mmap() call. > > > > > > > > > > After mprotect(), the initial range gets splitted into > > > > > (0xfd000000-0xfd001000) and (0xfd001000-0xfd002000). > > > > > > > > > > Then, at munmap(), the first range does not match any entry in > > > > > memtype_rbroot, and a message is seen in dmesg: > > > > > > > > > > x86/PAT: test:177 freeing invalid memtype [mem 0xfd000000-0xfd000fff] > > > > > > > > > > The second range still matches by accident, because matching only the end > > > > > address is acceptable (to handle shrinking VMA, added by 2039e6acaf94 > > > > > (x86/mm/pat: Change free_memtype() to support shrinking case)). > > > > > > > > Does this need a fixes tag? > > > > > > Yes, it should have > > > Fixes: 2e5d9c857d4e ("x86: PAT infrastructure patch") > > > thanks for the reminder. > > > > That commit is from 2008, is there a bug report on this issue? > > Not that I am aware of. I'm not entirely sure why, but I would guess due to > the combination of: > - This is not an issue for pages in RAM > - This only happens if VMAs are splitted > - The only user-visible effect is merely a pr_info(), and people may miss it. > > I only encountered this issue while "trying to be smart" with mprotect() on > a portion of mmap()-ed device memory, I guess probably not many people do > that. Or test it. I would have though some bots would have caught this. Although the log message is just pr_info()? That seems wrong - we have an error in the vma tree or the PAT tree and it's just an info printk? ... > > So your comment about the second range still matching by accident is > > misleading - it's not matched at all because you are searching for the > > exact match or the end address being the same (which it isn't in your > > interval tree). > > But the second range *does* match, because the end address match? > The second range is (0xfd001000-0xfd002000), which matches with > (0xfd000000-0xfd002000) in the interval tree. > > Perhaps I should be clearer in the description.. I see, yes. The error is with the first entry not being found. ... > > > > So the interval split should occur when the PAT changes and needs to be > > tracked differently. This does not happen when the vma is split - it > > happens when a vma is removed or when the PAT is changed. > > > > And, indeed, for the mremap() shrinking case, you already support > > finding a range by just the end and have an abstraction layer. The > > problem here is that you don't check by the start - but you could. You > > could make the change to memtype_erase() to search for the exact, end, > > or start and do what is necessary to shrink off the front of a region as > > well. > > I thought about this solution initially, but since the interval tree allow > overlapping ranges, it can be tricky to determine the "best match" out > of the overlapping ranges. But I agree that this approach (if possible) > would be better than the current patch. > > Let me think about this some more, and I will come back later. Reading this some more, I believe you can detect the correct address by matching the start address with the smallest end address (the smallest interval has to be the entry created by the vma mapping). > > > > > What I find very strange is that 2039e6acaf94 ("x86/mm/pat: Change > > free_memtype() to support shrinking case") enables shrinking of > > VM_PFNMAP, but doesn't allow shrinking the end address. Why is one > > allowed and the other not allowed? > > Not really sure what you mean. I think you are ultimately asking why that > commit only matches end address, and not start address? That's because > mremap() may shrink a VMA from [start, end] to [start, new_end] (with > new_end < end). In that case, the range [new_end, end] would be removed > from the interval tree, and that commit wants to match [new_end, end] to > [start, end]. > And I don't think mremap() can shrink [start, end] to [new_start, end]? Even an untrack_pfn() call will only remove the first entry that matches exactly or the end. Since the tree is sorted by start address, I guess the smallest (since it's not specified if it's ordered descending or ascending, and smaller makes more sense) interval will be deleted? That is, a vma cannot span different attributes but attributes can span vmas. Oh wow, this also means if you unmap the end vma first, you will not have an issue because the memtype_erase() (incorrectly named now) will resize your PAT entry to match the start vma range. I wonder what would happen in the "punch a hole" scenario where we move/MAP_FIXED/unmap the middle of a vma. My point is that it is unclear as to how the interval tree tracks the PAT to vma mappings (a more clean comment would be nice). It seems inconsistent and the situation you found should be handled in the translation layer, and not the generic code. Thanks, Liam