Hi Ryan, On 8/4/2023 4:46 PM, Ryan Roberts wrote: > On 04/08/2023 00:15, Yin, Fengwei wrote: >> >> >> On 8/3/2023 9:20 PM, Ryan Roberts wrote: >>> On 03/08/2023 11:48, Yin Fengwei wrote: >>>> >>>> >>>> On 8/3/23 17:58, Ryan Roberts wrote: >>>>> On 28/07/2023 08:09, Yin Fengwei wrote: >>>>>> It will be used to check whether the folio is mapped to specific >>>>>> VMA and whether the mapping address of folio is in the range. >>>>>> >>>>>> Also a helper function folio_within_vma() to check whether folio >>>>>> is in the range of vma based on folio_in_range(). >>>>>> >>>>>> Signed-off-by: Yin Fengwei <fengwei.yin@xxxxxxxxx> >>>>>> --- >>>>>> mm/internal.h | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> 1 file changed, 69 insertions(+) >>>>>> >>>>>> diff --git a/mm/internal.h b/mm/internal.h >>>>>> index 5a03bc4782a2..63de32154a48 100644 >>>>>> --- a/mm/internal.h >>>>>> +++ b/mm/internal.h >>>>>> @@ -585,6 +585,75 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma, >>>>>> bool write, int *locked); >>>>>> extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags, >>>>>> unsigned long bytes); >>>>>> + >>>>> >>>>> Hi Yin, >>>>> >>>>> I wanted to take a step back and consolidate my concerns about this function. I >>>>> should say that my understanding of the pgoff and index stuff is shaky and I >>>>> could therefore be wrong about some of this; if this is the case, then sorry for >>>>> the noise! But something about this function doesn't smell right to me, so I >>>>> figure its better to raise it... >>>> No worries. And thank you for looking at the code ans share the comments. >>>> That helps me a lot. Really appreciate it. >>>> >>>>> >>>>>> +/* >>>>>> + * Check whether the folio is in specific range >>>>> >>>>> What exactly is the function trying to do? I *think* the intention is to figure >>>>> out if a folio is fully and contiguously mapped within a range of virtual >>>>> addresses belonging to a specific virtual address space? And I assume you want >>>>> the answer to be precise? I'm assuming 'yes' for the below. >>>>> >>>>> I think the only way to do this is to actually check each PTE. And that causes a >>>>> problem, because a folio can straddle multiple PTE tables, which causes PTL >>>>> locking issues, and means having just a *pte pointer is insufficient if we need >>>>> to traverse multiple pte tables. So perhaps you need to allow for a false >>>>> negative in the case that the folio is not contained within a single pte table. >>>> Let's check the users of this function first: >>>> mlock/munlock: Needs only care whether the folio is in the range of VM_LOCKED VMA >>>> madvise: Needs to check whether the folio is in the range of VMA and >>>> range [start, end). >>> >>> Sure but those 2 ranges (the vma and the user-supplied range) are known to >>> intersect, right? So really there is only one range of interest; the >>> intersection. I would argue that should be done in a higher level wrapper, and >>> is not part of the core primitive to work out if a folio is mapped to a >>> particular range of virtual addresses. >>>> This range is from user space. It could contain >>>> VMA range (in this case, we only need to check VMA range) or is contained >>>> by VMA range. >>>> >>>> So we check: >>>> 1. Whether the folio is in the range. >>>> >>>> To do this, we need to first check whether the folio->index is in the >>>> VMA range. If not, we know the folio is not in VMA range. Just return false >>>> without further check. >>> >>> Right, so if the check fails, the folio is definitely not mapped by the vma, but >>> if it passes, it *might* be. Further checks are required. So why is this useful? >>> Why not just do the check that gives you the precise answer and skip this? >> What check can give precise answer? I don't think checking PTE is right at this >> stage. For following case (assume the folio is mapped in same page table and just >> care about the VMA2 range): >> >> |-----------vma1-----------|------------vma2---------| >> |-------folio-----| >> >> How can we tell whether the folio is in the range of VMA2 by just checking PTE? >> And in this case, why not bail out if we see folio is out range of VMA2? >> >> >> So only need to check PTE if we are sure the folio is in the range of VMA2: >> |-----------vma1-----------|------------vma2---------| >> |-------folio-----| >> >>> >>>> >>>> Then, we will check whether the folio is in the range which is defined as >>>> [max(start, vma->vm_start), min(end, vma->vm_end)). >>> >>> As par above comment, I would personally factor this out of the primitive. >>> >>>> >>>> >>>> This is the version in RFC as I was not aware of mremap case and forgot the >>>> page cache case. So if no mremap with an anonymous folio, this check is enough. >>>> But we need to add PTE check for mremap and page cache cases. >>>> >>>> 2. Check PTE for mremap in middle of large folio and page cache >>>> If the folio is normal 4K and if the folio is in the range, it's not possible >>>> the folio is partially mapped to two different VMA. So we are sure this folio >>>> is in the range. >>> >>> But you test for small folio and exit early without checking the PTE. Given the >>> index check only told you that the folio *might be* mapped, I don't see how you >>> can return true at this point for a small folio, without looking at the PTE? >> Yes. We can do this. You can check the discussion on the RFC version. I did move >> the normal 4K folio check out of this function. Yu and Huge suggested to just ignore >> the normal 4K as this API can' handle it. >> >> The thing changed from RFC is that we need to check PTE here. I tried to avoid to >> check PTE even for normal 4K folio. > > I'll go read the RFC. I've made my point; if you and others are convinced this > is correct, then fair enough. Thanks a lot for your comments. I will work on new version based on your and Andrew's comments. Sometimes, I blindly assume other people know the assumptions (like at least one page mapped in the VMA) I made. Which did make the review harder. I will try to improve on this in the future. Regards Yin, Fengwei > >> >>> >>> folio->index is just the page offset within the file it maps (or the (original) >>> VA/PAGE_SIZE for anon memory - I think?). And vma->vm_pgoff is the page offset >>> within the file that the vma starts at. So you could have a folio that contains >>> the contents of 1 file and a vma that maps another file, and the folio's index >>> might fall within the VMA, but it doesn't mean the folio is mapped by the vma; >>> its a different file. >>> >>> Or perhaps there is an assumption in the function's spec that the folio is known >>> to have at least one page mapped in the vma? That would change things... but you >>> should make that very clear in the spec. And in that case, you can move the >>> small folio check to the start and return true immediately. >> Yes. "At least one page mapped in VMA" is assumption here. I will make it clear in >> the comment. >> >> >> Regards >> Yin, Fengwei >> >>> >>>> >>>> Currently, check PTE didn't handle the case you mentioned. But my plan is >>>> checking whether the folio is mapped cross page table (by checking whether >>>> the folio start vaddr and end vaddr has same value for low aligned to PMD_SIZE). >>>> If it cross, I will treat it as out of VMA range. Otherwise, we can reuse >>>> current check because we know the PTE pointer is always valid. >>>> >>>> Obviously, the PTE checking needs hold pte lock as you pointed out. >>>> >>>> >>>> My understanding: >>>> pgoff is important if we have folio and VMA and want to know the virtual address of >>>> the folio mapped to. Like first, check whether pgoff of folio belongs to VMA, then get >>>> vaddr by: >>>> addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); >>>> rmap_walk() also depends on pgoff. You can check the vma_address() and rmap_walk() >>>> implementation. >>> >>> Yes but the rmap is only traversing vmas that are already known to map the same >>> file that the folio contains pages for. (See rmap_walk_file(): it grabs the >>> "mapping" from the folio, then traverses the list of vmas within the range of >>> interest "vma_interval_tree_foreach"). Its a bit more complicated for anon >>> memory but I think the princple is the same. >>> >>>> >>>>> >>>>>> + * >>>>>> + * First, check whether the folio is in the range of vma. >>>>>> + * Then, check whether the folio is mapped to the range of [start, end]. >>>>>> + * In the end, check whether the folio is fully mapped to the range. >>>>>> + * >>>>>> + * @pte page table pointer will be checked whether the large folio >>>>>> + * is fully mapped to. Currently, if mremap in the middle of >>>>>> + * large folio, the large folio could be mapped to to different >>>>>> + * VMA and address check can't identify this situation. >>>>>> + */ >>>>>> +static inline bool >>>>>> +folio_in_range(struct folio *folio, struct vm_area_struct *vma, >>>>>> + unsigned long start, unsigned long end, pte_t *pte) >>>>> >>>>> The prototype looks odd to me; Fundamentally it seems to me that you need the >>>>> folio, start and end virtual addresses (the range you want to check that it is >>>>> in), the pte pointer and the virtual address that the pte represents. That >>>>> virtual address allows you to figure out the offset between the pa and va. Then >>>>> you can look at all the ptes to figure out if they reference the folio's pfns, >>>>> and use the va to pa mapping info to figure out if its within the specified range. >>>>> >>>>> I don't really understand why the vma is useful. >>>>> >>>>>> +{ >>>>>> + pte_t ptent; >>>>>> + unsigned long i, nr = folio_nr_pages(folio); >>>>>> + pgoff_t pgoff, addr; >>>>>> + unsigned long vma_pglen = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT; >>>>>> + >>>>>> + VM_WARN_ON_FOLIO(folio_test_ksm(folio), folio); >>>>>> + >>>>>> + if (start < vma->vm_start) >>>>>> + start = vma->vm_start; >>>>>> + if (end > vma->vm_end) >>>>>> + end = vma->vm_end; >>>>>> + >>>>>> + pgoff = folio_pgoff(folio); >>>>>> + /* if folio start address is not in vma range */ >>>>>> + if (pgoff < vma->vm_pgoff || pgoff > vma->vm_pgoff + vma_pglen) >>>>>> + return false; >>>>> >>>>> Why is this pgoff calculation helpful? Surely it's only useful if the folio >>>>> belongs to the same file that the vma is mapping? Otherwise the folio's pgoff >>>>> might be referring to a completely different file than the vma's vm_pgoff. So >>>>> you will get spurious results. >>>>> >>>>>> + >>>>>> + addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); >>>>>> + if (addr < start || end - addr < folio_size(folio)) >>>>>> + return false; >>>>>> + >>>>>> + /* not necessary to check pte for none large folio */ >>>>>> + if (!folio_test_large(folio)) >>>>>> + return true; >>>>> >>>>> I don't understand why you don't need to check the pte for a small folio? As >>>>> above, if the folio doesn't belong to the file that the vma is mapping the above >>>>> checks seem wrong and you can't conclude that the folio is mapped in the range >>>>> without looking at the pte? >>>>> >>>>>> + >>>>>> + if (!pte) >>>>>> + return false; >>>>>> + >>>>>> + /* check whether parameter pte is associated with folio */ >>>>>> + ptent = ptep_get(pte); >>>>>> + if (pte_none(ptent) || !pte_present(ptent) || >>>>>> + pte_pfn(ptent) - folio_pfn(folio) >= nr) >>>>>> + return false; >>>>>> + >>>>>> + pte -= pte_pfn(ptent) - folio_pfn(folio); >>>>> >>>>> I think this could wander off the front or back of the pte table into arbitrary >>>>> memory if the folio is strddling multiple pte tables. >>>>> >>>>>> + for (i = 0; i < nr; i++, pte++) { >>>>>> + ptent = ptep_get(pte); >>>>>> + >>>>>> + if (pte_none(ptent) || !pte_present(ptent) || >>>>>> + pte_pfn(ptent) - folio_pfn(folio) >= nr) >>>>> >>>>> Doesn't !pte_present() also cover pte_none()? So I think the pte_none() check is >>>>> redundant? >>>> I think you are right. pte_none() can be removed here. >>>> >>>> >>>> Regards >>>> Yin, Fengwei >>>> >>>>> >>>>> Thanks, >>>>> Ryan >>>>> >>>>> >>>>>> + return false; >>>>>> + } >>>>>> + >>>>>> + return true; >>>>>> +} >>>>>> + >>>>>> +static inline bool >>>>>> +folio_within_vma(struct folio *folio, struct vm_area_struct *vma, pte_t *pte) >>>>>> +{ >>>>>> + return folio_in_range(folio, vma, vma->vm_start, vma->vm_end, pte); >>>>>> +} >>>>>> + >>>>>> /* >>>>>> * mlock_vma_folio() and munlock_vma_folio(): >>>>>> * should be called with vma's mmap_lock held for read or write, >>>>> >>> >