On 7/27/23 00:57, Yu Zhao wrote: > On Wed, Jul 26, 2023 at 6:49 AM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote: >> >> >> >> On 7/15/23 14:06, Yu Zhao wrote: >>> On Wed, Jul 12, 2023 at 12:31 AM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: >>>> >>>> On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote: >>>>> >>>>> Current kernel only lock base size folio during mlock syscall. >>>>> Add large folio support with following rules: >>>>> - Only mlock large folio when it's in VM_LOCKED VMA range >>>>> >>>>> - If there is cow folio, mlock the cow folio as cow folio >>>>> is also in VM_LOCKED VMA range. >>>>> >>>>> - munlock will apply to the large folio which is in VMA range >>>>> or cross the VMA boundary. >>>>> >>>>> The last rule is used to handle the case that the large folio is >>>>> mlocked, later the VMA is split in the middle of large folio >>>>> and this large folio become cross VMA boundary. >>>>> >>>>> Signed-off-by: Yin Fengwei <fengwei.yin@xxxxxxxxx> >>>>> --- >>>>> mm/mlock.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++--- >>>>> 1 file changed, 99 insertions(+), 5 deletions(-) >>>>> >>>>> diff --git a/mm/mlock.c b/mm/mlock.c >>>>> index 0a0c996c5c214..f49e079066870 100644 >>>>> --- a/mm/mlock.c >>>>> +++ b/mm/mlock.c >>>>> @@ -305,6 +305,95 @@ void munlock_folio(struct folio *folio) >>>>> local_unlock(&mlock_fbatch.lock); >>>>> } >>>>> >>>>> +static inline bool should_mlock_folio(struct folio *folio, >>>>> + struct vm_area_struct *vma) >>>>> +{ >>>>> + if (vma->vm_flags & VM_LOCKED) >>>>> + return (!folio_test_large(folio) || >>>>> + folio_within_vma(folio, vma)); >>>>> + >>>>> + /* >>>>> + * For unlock, allow munlock large folio which is partially >>>>> + * mapped to VMA. As it's possible that large folio is >>>>> + * mlocked and VMA is split later. >>>>> + * >>>>> + * During memory pressure, such kind of large folio can >>>>> + * be split. And the pages are not in VM_LOCKed VMA >>>>> + * can be reclaimed. >>>>> + */ >>>>> + >>>>> + return true; >>>> >>>> Looks good, or just >>>> >>>> should_mlock_folio() // or whatever name you see fit, can_mlock_folio()? >>>> { >>>> return !(vma->vm_flags & VM_LOCKED) || folio_within_vma(); >>>> } >>>> >>>>> +} >>>>> + >>>>> +static inline unsigned int get_folio_mlock_step(struct folio *folio, >>>>> + pte_t pte, unsigned long addr, unsigned long end) >>>>> +{ >>>>> + unsigned int nr; >>>>> + >>>>> + nr = folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte); >>>>> + return min_t(unsigned int, nr, (end - addr) >> PAGE_SHIFT); >>>>> +} >>>>> + >>>>> +void mlock_folio_range(struct folio *folio, struct vm_area_struct *vma, >>>>> + pte_t *pte, unsigned long addr, unsigned int nr) >>>>> +{ >>>>> + struct folio *cow_folio; >>>>> + unsigned int step = 1; >>>>> + >>>>> + mlock_folio(folio); >>>>> + if (nr == 1) >>>>> + return; >>>>> + >>>>> + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) { >>>>> + pte_t ptent; >>>>> + >>>>> + step = 1; >>>>> + ptent = ptep_get(pte); >>>>> + >>>>> + if (!pte_present(ptent)) >>>>> + continue; >>>>> + >>>>> + cow_folio = vm_normal_folio(vma, addr, ptent); >>>>> + if (!cow_folio || cow_folio == folio) { >>>>> + continue; >>>>> + } >>>>> + >>>>> + mlock_folio(cow_folio); >>>>> + step = get_folio_mlock_step(folio, ptent, >>>>> + addr, addr + (nr << PAGE_SHIFT)); >>>>> + } >>>>> +} >>>>> + >>>>> +void munlock_folio_range(struct folio *folio, struct vm_area_struct *vma, >>>>> + pte_t *pte, unsigned long addr, unsigned int nr) >>>>> +{ >>>>> + struct folio *cow_folio; >>>>> + unsigned int step = 1; >>>>> + >>>>> + munlock_folio(folio); >>>>> + if (nr == 1) >>>>> + return; >>>>> + >>>>> + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) { >>>>> + pte_t ptent; >>>>> + >>>>> + step = 1; >>>>> + ptent = ptep_get(pte); >>>>> + >>>>> + if (!pte_present(ptent)) >>>>> + continue; >>>>> + >>>>> + cow_folio = vm_normal_folio(vma, addr, ptent); >>>>> + if (!cow_folio || cow_folio == folio) { >>>>> + continue; >>>>> + } >>>>> + >>>>> + munlock_folio(cow_folio); >>>>> + step = get_folio_mlock_step(folio, ptent, >>>>> + addr, addr + (nr << PAGE_SHIFT)); >>>>> + } >>>>> +} >>>> >>>> I'll finish the above later. >>> >>> There is a problem here that I didn't have the time to elaborate: we >>> can't mlock() a folio that is within the range but not fully mapped >>> because this folio can be on the deferred split queue. When the split >>> happens, those unmapped folios (not mapped by this vma but are mapped >>> into other vmas) will be stranded on the unevictable lru. >> Checked remap case in past few days, I agree we shouldn't treat a folio >> in the range but not fully mapped as in_range folio. >> >> As for remap case, it's possible that the folio is not in deferred split >> queue. But part of folio is mapped to VM_LOCKED vma and other part of >> folio is mapped to none VM_LOCKED vma. In this case, page can't be split >> as it's not in deferred split queue. So page reclaim should be allowed to >> pick this folio up, split it and reclaim the pages in none VM_LOCKED vma. >> So we can't mlock such kind of folio. >> >> The same thing can happen with madvise_cold_or_pageout_pte_range(). >> I will update folio_in_vma() to check the PTE also. > > Thanks, and I think we should move forward with this series and fix > the potential mlock race problem separately since it's not caused by > this series. > > WDYT? Yes. Agree. Will send v3 with remap case covered. Regards Yin, Fengwei