On Wed, Sep 04, 2024 at 02:40:34PM -0400, Liam R. Howlett wrote: > * Nam Cao <namcao@xxxxxxxxxxxxx> [240904 03:59]: > > On Tue, Sep 03, 2024 at 11:56:57AM -0400, Liam R. Howlett wrote: > > > * Nam Cao <namcao@xxxxxxxxxxxxx> [240903 06:36]: > > ... > > > > On Tue, Aug 27, 2024 at 12:01:28PM -0400, Liam R. Howlett wrote: > > > > > * Nam Cao <namcao@xxxxxxxxxxxxx> [240827 03:59]: > > > > > > On Mon, Aug 26, 2024 at 09:58:11AM -0400, Liam R. Howlett wrote: > > > > > > > * Nam Cao <namcao@xxxxxxxxxxxxx> [240825 11:29]: > > > > > So the interval split should occur when the PAT changes and needs to be > > > > > tracked differently. This does not happen when the vma is split - it > > > > > happens when a vma is removed or when the PAT is changed. > > > > > > > > > > And, indeed, for the mremap() shrinking case, you already support > > > > > finding a range by just the end and have an abstraction layer. The > > > > > problem here is that you don't check by the start - but you could. You > > > > > could make the change to memtype_erase() to search for the exact, end, > > > > > or start and do what is necessary to shrink off the front of a region as > > > > > well. > > > > > > > > I thought about this solution initially, but since the interval tree allow > > > > overlapping ranges, it can be tricky to determine the "best match" out > > > > of the overlapping ranges. But I agree that this approach (if possible) > > > > would be better than the current patch. > > > > > > > > Let me think about this some more, and I will come back later. > > > > > > Reading this some more, I believe you can detect the correct address by > > > matching the start address with the smallest end address (the smallest > > > interval has to be the entry created by the vma mapping). > > > > I don't think that would cover all cases. For example, if the tree has 2 > > intervals: [0x0000-0x2000] and [0x1000-0x3000]. Now, the mm subsystem tells > > us that the interval [0x1000-0x2000] needs to be removed (e.g. user does > > munmap()), your proposal would match this to the second interval. After the > > removal, the tree has [0-0x2000] and [0x2000-0x3000] > > > > Then, mm subsystem says [0x1000-0x3000] should be removed, and that doesn't > > match anything. Turns out, the first removal was meant for the first > > interval, but we didn't have enough information at the time to determine > > that. > > > > Bottom line is, it is not possible to correctly match [0x1000-0x2000] to > > [0x0000-0x2000] and [0x1000-0x3000]: both matches can be valid. > > But those ranges won't exist. What appears to be happening in this code > is that there are higher levels of non-overlapping ranges with > memory (cache) types (or none are defined) , which are tracked on page > granularity. So we can't have a page that has two memory type. > > The overlapping happens later, when the vmas are mapped. And we are > ensuring that the mapping of the vmas match the higher, larger areas. > The vmas are inserted with memtype_check_insert() which calls > memtype_check_conflict() that ensures any overlapping areas have the > same type as the one being added, so either there is no match or the > interval(s) with this page is set to a specific type. I suspect there > can only really be one range. > > So I don't think overlapping areas like above could exist. The vma > cache type has to be the same throughout. It has to be the same type as > all overlapping areas. Dave agreed with you, so I am likely the confused one, but I still think the overlapping areas as I described do exist. For example, this userspace code: #include <stdio.h> #include <sys/mman.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #define PCI_BAR "/sys/devices/pci0000:00/0000:00:02.0/resource0" int main(void) { void *p1, *p2; int fd, ret; fd = open(PCI_BAR, O_RDWR); // track 0xfd000000-0xfd001fff p1 = mmap(0, 0x2000, PROT_READ, MAP_SHARED, fd, 0); // track 0xfd001000-0xfd002fff p2 = mmap(0, 0x2000, PROT_READ, MAP_SHARED, fd, 0x1000); // untrack 0xfd001000-0xfd001fff munmap(p2, 0x1000); // untrack 0xfd000000-0xfd001fff munmap(p1, 0x2000); // untrack 0xfd002000-0xfd002fff munmap(p2 + 0x1000, 0x1000); } If I pause this program right after the two mmap(), before any munmap(), then: $cat /sys/kernel/debug/x86/pat_memtype_list PAT memtype list: PAT: [mem 0x00000000bffe0000-0x00000000bffe2000] write-back PAT: [mem 0x00000000bffe1000-0x00000000bffe2000] write-back PAT: [mem 0x00000000fd000000-0x00000000fd002000] uncached-minus <-- what I described PAT: [mem 0x00000000fd001000-0x00000000fd003000] uncached-minus <-- what I described PAT: [mem 0x00000000febc0000-0x00000000febe0000] uncached-minus PAT: [mem 0x00000000fed00000-0x00000000fed01000] uncached-minus PAT: [mem 0x00000000fed00000-0x00000000fed01000] uncached-minus The 2 mmap() call would create the overlapping intervals as I described. Then, I let the C program run to completion, see what happen in dmesg: x86/PAT: memtype_reserve added [mem 0xfd000000-0xfd001fff], track uncached-minus, req uncached-minus, ret uncached-minus x86/PAT: Overlap at 0xfd000000-0xfd002000 x86/PAT: memtype_reserve added [mem 0xfd001000-0xfd002fff], track uncached-minus, req uncached-minus, ret uncached-minus x86/PAT: memtype_free request [mem 0xfd001000-0xfd001fff] x86/PAT: test:178 freeing invalid memtype [mem 0xfd000000-0xfd001fff] x86/PAT: memtype_free request [mem 0xfd002000-0xfd002fff] The problem I am raising is the first munmap() call: [0xfd001000-0xfd001fff] would be untracked, but there is no way to tell for sure which interval it belongs to. The current implementation matches it to the first range, but it actually belongs to the second range. This incorrect matching results in the "freeing invalid memtype" later on. Hopefully I'm not being an idiot and wasting everyone's time.. > > Also, your ranges are inclusive while the ranges passed in seem to be > exclusive on the end address, so your example would look more like: > [0x0000-0x2000) [0x2000-0x3000). > > You can see this documented in memtype_reserve() where sanitize_phys() > is called. > > So we could have a VMA of [0x1000-0x2000), but this vma would have to be > in the first range. [0x0000-0x0FFF) would also be in the first range. > > I think that searching for the smallest area containing the entry will > yield the desired entry in the interval tree. > > Note that there is debugging support in the Documentation so you can go > look at what is in there with debugfs. > > ... > > > One solution I can think of: stop allowing overlapping intervals. Instead, > > the overlapping portions would be split into new intervals with some > > reference counting. memtype_erase() would need to be modified to: > > - assemble the potentially split intervals > > - split the intervals if needed > > The point is, there wouldn't be any confusion with matching overlapping > > intervals. > > > > I will give it a try when I have some time, unless someone sees a problem > > with it or has a better idea. > > I don't think this will work at all. It is dependent of overlapping > ranges to ensure the vmas match what is allowed in certain areas. We can ensure that the cache type is the same, before splitting, so I think it can work? But let's clear up the other disagreement first. Best regards, Nam