On 2/17/21 12:25 PM, Peter Xu wrote: > On Wed, Feb 10, 2021 at 04:03:22PM -0800, Mike Kravetz wrote: >> There was is no hugetlb specific routine for clearing soft dirty and >> other referrences. The 'default' routines would only clear the >> VM_SOFTDIRTY flag in the vma. >> >> Add new routine specifically for hugetlb vmas. >> >> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx> >> --- >> fs/proc/task_mmu.c | 110 +++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 110 insertions(+) >> >> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >> index 829b35016aaa..f06cf9b131a8 100644 >> --- a/fs/proc/task_mmu.c >> +++ b/fs/proc/task_mmu.c >> @@ -1116,6 +1116,115 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma, >> } >> #endif >> >> +#ifdef CONFIG_HUGETLB_PAGE >> +static inline bool huge_pte_is_pinned(struct vm_area_struct *vma, >> + unsigned long addr, pte_t pte) >> +{ >> + struct page *page; >> + >> + if (likely(!atomic_read(&vma->vm_mm->has_pinned))) >> + return false; >> + page = pte_page(pte); >> + if (!page) >> + return false; >> + return page_maybe_dma_pinned(page); >> +} >> + >> +static int clear_refs_hugetlb_range(pte_t *ptep, unsigned long hmask, >> + unsigned long addr, unsigned long end, >> + struct mm_walk *walk) >> +{ >> + struct clear_refs_private *cp = walk->private; >> + struct vm_area_struct *vma = walk->vma; >> + struct hstate *h = hstate_vma(walk->vma); >> + unsigned long adj_start = addr, adj_end = end; >> + spinlock_t *ptl; >> + pte_t old_pte, pte; >> + >> + /* >> + * clear_refs should only operate on complete vmas. Therefore, >> + * values passed here should be huge page aligned and huge page >> + * size in length. Quick validation before taking any action in >> + * case upstream code is changed. >> + */ >> + if ((addr & hmask) != addr || end - addr != huge_page_size(h)) { >> + WARN_ONCE(1, "%s passed unaligned address\n", __func__); >> + return 1; >> + } > > I wouldn't worry too much on the interface change - The one who will change the > interface should guarantee all existing hooks will still work, isn't it? :) Yeah, I can drop this. > It's slightly confusing to me on why "clear_refs should only operate on > complete vmas" is related to the check, though. Mostly me thinking that since it is operating on complete (hugetlb) vma, then we know vms is huge page aligned and a multiple of huge page in size. So, all passed addressed should be huge page aligned as well. > >> + >> + ptl = huge_pte_lock(hstate_vma(vma), walk->mm, ptep); >> + >> + /* Soft dirty and pmd sharing do not mix */ > > Right, this seems to be a placeholder for unsharing code. Sorry, comment was left over from earlier code. Unsharing is actually done below, I forgot to remove comment. > Though maybe we can do that earlier in pre_vma() hook? That should be per-vma > rather than handling one specific huge page here, hence more efficient imho. Yes, let me look into that. The code below is certianly not the most efficient. > this reminded me that I should also better move hugetlb_unshare_all_pmds() of > my other patch into hugetlb.c, so that this code can call it. Currently it's a > static function in userfaultfd.c. > >> + >> + pte = huge_ptep_get(ptep); >> + if (!pte_present(pte)) >> + goto out; >> + if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) >> + goto out; >> + >> + if (cp->type == CLEAR_REFS_SOFT_DIRTY) { > > Maybe move this check into clear_refs_test_walk()? We can bail out earlier if: > > (is_vm_hugetlb_page(vma) && (type != CLEAR_REFS_SOFT_DIRTY)) > Yes, we can do that. I was patterning this after the other 'clear_refs' routines. But, they can clear things besides soft dirty. Since soft dirty is the only thing handled for hugetlb, we can bail earlier. >> + if (huge_pte_is_pinned(vma, addr, pte)) >> + goto out; > > Out of topic of this patchset, but it's definitely a pity that we can't track > soft dirty for pinned pages. Currently the assumption of the pte code path is: > "if this page can be DMA written then we won't know whether data changed after > all, then tracking dirty is meaningless", however that's prone to change when > new hardwares coming, say, IOMMU could start to trap DMA writes already. > > But again that's another story.. and we should just follow what we do with > non-hugetlbfs for sure here, until some day if we'd like to revive soft dirty > tracking with pinned pages. > >> + >> + /* >> + * soft dirty and pmd sharing do not work together as >> + * per-process is tracked in ptes, and pmd sharing allows >> + * processed to share ptes. We unshare any pmds here. >> + */ >> + adjust_range_if_pmd_sharing_possible(vma, &adj_start, &adj_end); > > Ideally when reach here, huge pmd sharing won't ever exist, right? Then do we > still need to adjust the range at all? Right, we should be able to do it earlier. Thanks again for taking a look at this. -- Mike Kravetz > > Thanks, >