On 04/08/2023 00:15, Yin, Fengwei wrote: > > > On 8/3/2023 9:20 PM, Ryan Roberts wrote: >> On 03/08/2023 11:48, Yin Fengwei wrote: >>> >>> >>> On 8/3/23 17:58, Ryan Roberts wrote: >>>> On 28/07/2023 08:09, Yin Fengwei wrote: >>>>> It will be used to check whether the folio is mapped to specific >>>>> VMA and whether the mapping address of folio is in the range. >>>>> >>>>> Also a helper function folio_within_vma() to check whether folio >>>>> is in the range of vma based on folio_in_range(). >>>>> >>>>> Signed-off-by: Yin Fengwei <fengwei.yin@xxxxxxxxx> >>>>> --- >>>>> mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> 1 file changed, 69 insertions(+) >>>>> >>>>> diff --git a/mm/internal.h b/mm/internal.h >>>>> index 5a03bc4782a2..63de32154a48 100644 >>>>> --- a/mm/internal.h >>>>> +++ b/mm/internal.h >>>>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma, >>>>> bool write, int *locked); >>>>> extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags, >>>>> unsigned long bytes); >>>>> + >>>> >>>> Hi Yin, >>>> >>>> I wanted to take a step back and consolidate my concerns about this function. I >>>> should say that my understanding of the pgoff and index stuff is shaky and I >>>> could therefore be wrong about some of this; if this is the case, then sorry for >>>> the noise! But something about this function doesn't smell right to me, so I >>>> figure its better to raise it... >>> No worries. And thank you for looking at the code ans share the comments. >>> That helps me a lot. Really appreciate it. >>> >>>> >>>>> +/* >>>>> + * Check whether the folio is in specific range >>>> >>>> What exactly is the function trying to do? I *think* the intention is to figure >>>> out if a folio is fully and contiguously mapped within a range of virtual >>>> addresses belonging to a specific virtual address space? And I assume you want >>>> the answer to be precise? I'm assuming 'yes' for the below. >>>> >>>> I think the only way to do this is to actually check each PTE. And that causes a >>>> problem, because a folio can straddle multiple PTE tables, which causes PTL >>>> locking issues, and means having just a *pte pointer is insufficient if we need >>>> to traverse multiple pte tables. So perhaps you need to allow for a false >>>> negative in the case that the folio is not contained within a single pte table. >>> Let's check the users of this function first: >>> mlock/munlock: Needs only care whether the folio is in the range of VM_LOCKED VMA >>> madvise: Needs to check whether the folio is in the range of VMA and >>> range [start, end). >> >> Sure but those 2 ranges (the vma and the user-supplied range) are known to >> intersect, right? So really there is only one range of interest; the >> intersection. I would argue that should be done in a higher level wrapper, and >> is not part of the core primitive to work out if a folio is mapped to a >> particular range of virtual addresses. >>> This range is from user space. It could contain >>> VMA range (in this case, we only need to check VMA range) or is contained >>> by VMA range. >>> >>> So we check: >>> 1. Whether the folio is in the range. >>> >>> To do this, we need to first check whether the folio->index is in the >>> VMA range. If not, we know the folio is not in VMA range. Just return false >>> without further check. >> >> Right, so if the check fails, the folio is definitely not mapped by the vma, but >> if it passes, it *might* be. Further checks are required. So why is this useful? >> Why not just do the check that gives you the precise answer and skip this? > What check can give precise answer? I don't think checking PTE is right at this > stage. For following case (assume the folio is mapped in same page table and just > care about the VMA2 range): > > |-----------vma1-----------|------------vma2---------| > |-------folio-----| > > How can we tell whether the folio is in the range of VMA2 by just checking PTE? > And in this case, why not bail out if we see folio is out range of VMA2? > > > So only need to check PTE if we are sure the folio is in the range of VMA2: > |-----------vma1-----------|------------vma2---------| > |-------folio-----| > >> >>> >>> Then, we will check whether the folio is in the range which is defined as >>> [max(start, vma->vm_start), min(end, vma->vm_end)). >> >> As par above comment, I would personally factor this out of the primitive. >> >>> >>> >>> This is the version in RFC as I was not aware of mremap case and forgot the >>> page cache case. So if no mremap with an anonymous folio, this check is enough. >>> But we need to add PTE check for mremap and page cache cases. >>> >>> 2. Check PTE for mremap in middle of large folio and page cache >>> If the folio is normal 4K and if the folio is in the range, it's not possible >>> the folio is partially mapped to two different VMA. So we are sure this folio >>> is in the range. >> >> But you test for small folio and exit early without checking the PTE. Given the >> index check only told you that the folio *might be* mapped, I don't see how you >> can return true at this point for a small folio, without looking at the PTE? > Yes. We can do this. You can check the discussion on the RFC version. I did move > the normal 4K folio check out of this function. Yu and Huge suggested to just ignore > the normal 4K as this API can' handle it. > > The thing changed from RFC is that we need to check PTE here. I tried to avoid to > check PTE even for normal 4K folio. I'll go read the RFC. I've made my point; if you and others are convinced this is correct, then fair enough. > >> >> folio->index is just the page offset within the file it maps (or the (original) >> VA/PAGE_SIZE for anon memory - I think?). And vma->vm_pgoff is the page offset >> within the file that the vma starts at. So you could have a folio that contains >> the contents of 1 file and a vma that maps another file, and the folio's index >> might fall within the VMA, but it doesn't mean the folio is mapped by the vma; >> its a different file. >> >> Or perhaps there is an assumption in the function's spec that the folio is known >> to have at least one page mapped in the vma? That would change things... but you >> should make that very clear in the spec. And in that case, you can move the >> small folio check to the start and return true immediately. > Yes. "At least one page mapped in VMA" is assumption here. I will make it clear in > the comment. > > > Regards > Yin, Fengwei > >> >>> >>> Currently, check PTE didn't handle the case you mentioned. But my plan is >>> checking whether the folio is mapped cross page table (by checking whether >>> the folio start vaddr and end vaddr has same value for low aligned to PMD_SIZE). >>> If it cross, I will treat it as out of VMA range. Otherwise, we can reuse >>> current check because we know the PTE pointer is always valid. >>> >>> Obviously, the PTE checking needs hold pte lock as you pointed out. >>> >>> >>> My understanding: >>> pgoff is important if we have folio and VMA and want to know the virtual address of >>> the folio mapped to. Like first, check whether pgoff of folio belongs to VMA, then get >>> vaddr by: >>> addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); >>> rmap_walk() also depends on pgoff. You can check the vma_address() and rmap_walk() >>> implementation. >> >> Yes but the rmap is only traversing vmas that are already known to map the same >> file that the folio contains pages for. (See rmap_walk_file(): it grabs the >> "mapping" from the folio, then traverses the list of vmas within the range of >> interest "vma_interval_tree_foreach"). Its a bit more complicated for anon >> memory but I think the princple is the same. >> >>> >>>> >>>>> + * >>>>> + * First, check whether the folio is in the range of vma. >>>>> + * Then, check whether the folio is mapped to the range of [start, end]. >>>>> + * In the end, check whether the folio is fully mapped to the range. >>>>> + * >>>>> + * @pte page table pointer will be checked whether the large folio >>>>> + * is fully mapped to. Currently, if mremap in the middle of >>>>> + * large folio, the large folio could be mapped to to different >>>>> + * VMA and address check can't identify this situation. >>>>> + */ >>>>> +static inline bool >>>>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma, >>>>> + unsigned long start, unsigned long end, pte_t *pte) >>>> >>>> The prototype looks odd to me; Fundamentally it seems to me that you need the >>>> folio, start and end virtual addresses (the range you want to check that it is >>>> in), the pte pointer and the virtual address that the pte represents. That >>>> virtual address allows you to figure out the offset between the pa and va. Then >>>> you can look at all the ptes to figure out if they reference the folio's pfns, >>>> and use the va to pa mapping info to figure out if its within the specified range. >>>> >>>> I don't really understand why the vma is useful. >>>> >>>>> +{ >>>>> + pte_t ptent; >>>>> + unsigned long i, nr = folio_nr_pages(folio); >>>>> + pgoff_t pgoff, addr; >>>>> + unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT; >>>>> + >>>>> + VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio); >>>>> + >>>>> + if (start < vma->vm_start) >>>>> + start = vma->vm_start; >>>>> + if (end > vma->vm_end) >>>>> + end = vma->vm_end; >>>>> + >>>>> + pgoff = folio_pgoff(folio); >>>>> + /* if folio start address is not in vma range */ >>>>> + if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen) >>>>> + return false; >>>> >>>> Why is this pgoff calculation helpful? Surely it's only useful if the folio >>>> belongs to the same file that the vma is mapping? Otherwise the folio's pgoff >>>> might be referring to a completely different file than the vma's vm_pgoff. So >>>> you will get spurious results. >>>> >>>>> + >>>>> + addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); >>>>> + if (addr < start || end - addr < folio_size(folio)) >>>>> + return false; >>>>> + >>>>> + /* not necessary to check pte for none large folio */ >>>>> + if (!folio_test_large(folio)) >>>>> + return true; >>>> >>>> I don't understand why you don't need to check the pte for a small folio? As >>>> above, if the folio doesn't belong to the file that the vma is mapping the above >>>> checks seem wrong and you can't conclude that the folio is mapped in the range >>>> without looking at the pte? >>>> >>>>> + >>>>> + if (!pte) >>>>> + return false; >>>>> + >>>>> + /* check whether parameter pte is associated with folio */ >>>>> + ptent = ptep_get(pte); >>>>> + if (pte_none(ptent) || !pte_present(ptent) || >>>>> + pte_pfn(ptent) - folio_pfn(folio) >= nr) >>>>> + return false; >>>>> + >>>>> + pte -= pte_pfn(ptent) - folio_pfn(folio); >>>> >>>> I think this could wander off the front or back of the pte table into arbitrary >>>> memory if the folio is strddling multiple pte tables. >>>> >>>>> + for (i = 0; i < nr; i++, pte++) { >>>>> + ptent = ptep_get(pte); >>>>> + >>>>> + if (pte_none(ptent) || !pte_present(ptent) || >>>>> + pte_pfn(ptent) - folio_pfn(folio) >= nr) >>>> >>>> Doesn't !pte_present() also cover pte_none()? So I think the pte_none() check is >>>> redundant? >>> I think you are right. pte_none() can be removed here. >>> >>> >>> Regards >>> Yin, Fengwei >>> >>>> >>>> Thanks, >>>> Ryan >>>> >>>> >>>>> + return false; >>>>> + } >>>>> + >>>>> + return true; >>>>> +} >>>>> + >>>>> +static inline bool >>>>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte) >>>>> +{ >>>>> + return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte); >>>>> +} >>>>> + >>>>> /* >>>>> * mlock_vma_folio() and munlock_vma_folio(): >>>>> * should be called with vma's mmap_lock held for read or write, >>>> >>