On Wed, Jul 26, 2023 at 6:49 AM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote: > > > > On 7/15/23 14:06, Yu Zhao wrote: > > On Wed, Jul 12, 2023 at 12:31 AM Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > >> > >> On Wed, Jul 12, 2023 at 12:02 AM Yin Fengwei <fengwei.yin@xxxxxxxxx> wrote: > >>> > >>> Current kernel only lock base size folio during mlock syscall. > >>> Add large folio support with following rules: > >>> - Only mlock large folio when it's in VM_LOCKED VMA range > >>> > >>> - If there is cow folio, mlock the cow folio as cow folio > >>> is also in VM_LOCKED VMA range. > >>> > >>> - munlock will apply to the large folio which is in VMA range > >>> or cross the VMA boundary. > >>> > >>> The last rule is used to handle the case that the large folio is > >>> mlocked, later the VMA is split in the middle of large folio > >>> and this large folio become cross VMA boundary. > >>> > >>> Signed-off-by: Yin Fengwei <fengwei.yin@xxxxxxxxx> > >>> --- > >>> mm/mlock.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++--- > >>> 1 file changed, 99 insertions(+), 5 deletions(-) > >>> > >>> diff --git a/mm/mlock.c b/mm/mlock.c > >>> index 0a0c996c5c214..f49e079066870 100644 > >>> --- a/mm/mlock.c > >>> +++ b/mm/mlock.c > >>> @@ -305,6 +305,95 @@ void munlock_folio(struct folio *folio) > >>> local_unlock(&mlock_fbatch.lock); > >>> } > >>> > >>> +static inline bool should_mlock_folio(struct folio *folio, > >>> + struct vm_area_struct *vma) > >>> +{ > >>> + if (vma->vm_flags & VM_LOCKED) > >>> + return (!folio_test_large(folio) || > >>> + folio_within_vma(folio, vma)); > >>> + > >>> + /* > >>> + * For unlock, allow munlock large folio which is partially > >>> + * mapped to VMA. As it's possible that large folio is > >>> + * mlocked and VMA is split later. > >>> + * > >>> + * During memory pressure, such kind of large folio can > >>> + * be split. And the pages are not in VM_LOCKed VMA > >>> + * can be reclaimed. > >>> + */ > >>> + > >>> + return true; > >> > >> Looks good, or just > >> > >> should_mlock_folio() // or whatever name you see fit, can_mlock_folio()? > >> { > >> return !(vma->vm_flags & VM_LOCKED) || folio_within_vma(); > >> } > >> > >>> +} > >>> + > >>> +static inline unsigned int get_folio_mlock_step(struct folio *folio, > >>> + pte_t pte, unsigned long addr, unsigned long end) > >>> +{ > >>> + unsigned int nr; > >>> + > >>> + nr = folio_pfn(folio) + folio_nr_pages(folio) - pte_pfn(pte); > >>> + return min_t(unsigned int, nr, (end - addr) >> PAGE_SHIFT); > >>> +} > >>> + > >>> +void mlock_folio_range(struct folio *folio, struct vm_area_struct *vma, > >>> + pte_t *pte, unsigned long addr, unsigned int nr) > >>> +{ > >>> + struct folio *cow_folio; > >>> + unsigned int step = 1; > >>> + > >>> + mlock_folio(folio); > >>> + if (nr == 1) > >>> + return; > >>> + > >>> + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) { > >>> + pte_t ptent; > >>> + > >>> + step = 1; > >>> + ptent = ptep_get(pte); > >>> + > >>> + if (!pte_present(ptent)) > >>> + continue; > >>> + > >>> + cow_folio = vm_normal_folio(vma, addr, ptent); > >>> + if (!cow_folio || cow_folio == folio) { > >>> + continue; > >>> + } > >>> + > >>> + mlock_folio(cow_folio); > >>> + step = get_folio_mlock_step(folio, ptent, > >>> + addr, addr + (nr << PAGE_SHIFT)); > >>> + } > >>> +} > >>> + > >>> +void munlock_folio_range(struct folio *folio, struct vm_area_struct *vma, > >>> + pte_t *pte, unsigned long addr, unsigned int nr) > >>> +{ > >>> + struct folio *cow_folio; > >>> + unsigned int step = 1; > >>> + > >>> + munlock_folio(folio); > >>> + if (nr == 1) > >>> + return; > >>> + > >>> + for (; nr > 0; pte += step, addr += (step << PAGE_SHIFT), nr -= step) { > >>> + pte_t ptent; > >>> + > >>> + step = 1; > >>> + ptent = ptep_get(pte); > >>> + > >>> + if (!pte_present(ptent)) > >>> + continue; > >>> + > >>> + cow_folio = vm_normal_folio(vma, addr, ptent); > >>> + if (!cow_folio || cow_folio == folio) { > >>> + continue; > >>> + } > >>> + > >>> + munlock_folio(cow_folio); > >>> + step = get_folio_mlock_step(folio, ptent, > >>> + addr, addr + (nr << PAGE_SHIFT)); > >>> + } > >>> +} > >> > >> I'll finish the above later. > > > > There is a problem here that I didn't have the time to elaborate: we > > can't mlock() a folio that is within the range but not fully mapped > > because this folio can be on the deferred split queue. When the split > > happens, those unmapped folios (not mapped by this vma but are mapped > > into other vmas) will be stranded on the unevictable lru. > Checked remap case in past few days, I agree we shouldn't treat a folio > in the range but not fully mapped as in_range folio. > > As for remap case, it's possible that the folio is not in deferred split > queue. But part of folio is mapped to VM_LOCKED vma and other part of > folio is mapped to none VM_LOCKED vma. In this case, page can't be split > as it's not in deferred split queue. So page reclaim should be allowed to > pick this folio up, split it and reclaim the pages in none VM_LOCKED vma. > So we can't mlock such kind of folio. > > The same thing can happen with madvise_cold_or_pageout_pte_range(). > I will update folio_in_vma() to check the PTE also. Thanks, and I think we should move forward with this series and fix the potential mlock race problem separately since it's not caused by this series. WDYT?