On 08/25/22 09:43, Baolin Wang wrote: > > > On 8/25/2022 7:34 AM, Mike Kravetz wrote: > > On 08/24/22 17:41, Baolin Wang wrote: > > > > > > > > > On 8/24/2022 3:31 PM, David Hildenbrand wrote: > > > > > > > > > > > > > > > > IMHO, these follow_huge_xxx() functions are arch-specified at first and > > > > > > > > were moved into the common hugetlb.c by commit 9e5fc74c3025 ("mm: > > > > > > > > hugetlb: Copy general hugetlb code from x86 to mm"), and now there are > > > > > > > > still some arch-specified follow_huge_xxx() definition, for example: > > > > > > > > ia64: follow_huge_addr > > > > > > > > powerpc: follow_huge_pd > > > > > > > > s390: follow_huge_pud > > > > > > > > > > > > > > > > What I mean is that follow_hugetlb_page() is a common and > > > > > > > > not-arch-specified function, is it suitable to change it to be > > > > > > > > arch-specified? > > > > > > > > And thinking more, can we rename follow_hugetlb_page() as > > > > > > > > hugetlb_page_faultin() and simplify it to only handle the page faults of > > > > > > > > hugetlb like the faultin_page() for normal page? That means we can make > > > > > > > > sure only follow_page_mask() can handle hugetlb. > > > > > > > > > > > > > > > > > > > > Something like that might work, but you still have two page table walkers > > > > > > for hugetlb. I like David's idea (if I understand it correctly) of > > > > > > > > > > What I mean is we may change the hugetlb handling like normal page: > > > > > 1) use follow_page_mask() to look up a hugetlb firstly. > > > > > 2) if can not get the hugetlb, then try to page fault by > > > > > hugetlb_page_faultin(). > > > > > 3) if page fault successed, then retry to find hugetlb by > > > > > follow_page_mask(). > > > > > > > > That implies putting more hugetlbfs special code into generic GUP, > > > > turning it even more complicated. But of course, it depends on how the > > > > end result looks like. My gut feeling was that hugetlb is better handled > > > > in follow_hugetlb_page() separately (just like we do with a lot of other > > > > page table walkers). > > > > > > OK, fair enough. > > > > > > > > > > > > > Just a rough thought, and I need more investigation for my idea and > > > > > David's idea. > > > > > > > > > > > using follow_hugetlb_page for both cases. As noted, it will need to be > > > > > > taught how to not trigger faults in the follow_page_mask case. > > > > > > > > > > Anyway, I also agree we need some cleanup, and firstly I think we should > > > > > cleanup these arch-specified follow_huge_xxx() on some architectures > > > > > which are similar with the common ones. I will look into these. > > > > > > > > There was a recent discussion on that, e.g.: > > > > > > > > https://lkml.kernel.org/r/20220818135717.609eef8a@thinkpad > > > > > > Thanks. > > > > > > > > > > > > > > > > > However, considering cleanup may need more investigation and > > > > > refactoring, now I prefer to make these bug-fix patches of this patchset > > > > > into mainline firstly, which are suitable to backport to old version to > > > > > fix potential race issues. Mike and David, how do you think? Could you > > > > > help to review these patches? Thanks. > > > > > > > > Patch #1 certainly add more special code just to handle another hugetlb > > > > corner case (CONT pages), and maybe just making it all use > > > > follow_hugetlb_page() would be even cleaner and less error prone. > > > > > > > > I agree that locking is shaky, but I'm not sure if we really want to > > > > backport this to stable trees: > > > > > > > > https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html > > > > > > > > "It must fix a real bug that bothers people (not a, “This could be a > > > > problem...” type thing)." > > > > > > > > > > > > Do we actually have any instance of this being a real (and not a > > > > theoretical) problem? If not, I'd rather clean it all up right away. > > > > > > I think this is a real problem (not theoretical), and easy to write some > > > code to show the issue. For example, suppose thread A is trying to look up a > > > CONT-PTE size hugetlb page under the lock, however antoher thread B can > > > migrate the CONT-PTE hugetlb page at the same time, which will cause thread > > > A to get an incorrect page, if thread A want to do something for this > > > incorrect page, error occurs. > > > > Is the primary concern the locking? If so, I am not sure we have an issue. > > Yes. > > > As mentioned in your commit message, current code will use > > pte_offset_map_lock(). pte_offset_map_lock uses pte_lockptr, and pte_lockptr > > will either be the mm wide lock or pmd_page lock. To me, it seems that > > The ALLOC_SPLIT_PTLOCKS can be always true on my machine, that means the > pte_lockptr() will always use the PTE page lock, however huge_pte_lock() > will use the mm wide lock. Yes, the different calling context/path to the locking code will cause a different lock to be used. I thought of the AFTER sending the above. > > > either would provide correct synchronization for CONT-PTE entries. Am I > > missing something or misreading the code? > > > > I started looking at code cleanup suggested by David. Here is a quick > > patch (not tested and likely containing errors) to see if this is a step > > in the right direction. > > > > I like it because we get rid of/combine all those follow_huge_p*d > > routines. > > Great, this looks straight forward to me (some nits as below). > David, how do you think? > I will continue to refine this based on suggestions from you and David. > > +struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma, > > + unsigned long address, unsigned int flags) > > +{ > > + struct hstate *h = hstate_vma(vma); > > + struct mm_struct *mm = vma->vm_mm; > > + unsigned long haddr = address & huge_page_mask(h); > > + struct page *page = NULL; > > + spinlock_t *ptl; > > + pte_t *pte, entry; > > + > > + /* > > + * FOLL_PIN is not supported for follow_page(). Ordinary GUP goes via > > + * follow_hugetlb_page(). > > + */ > > + if (WARN_ON_ONCE(flags & FOLL_PIN)) > > + return NULL; > > + > > + pte = huge_pte_offset(mm, haddr, huge_page_size(h)); > > + if (!pte) > > + return NULL; > > + > > +retry: > > + ptl = huge_pte_lock(h, mm, pte); > > + entry = huge_ptep_get(pte); > > + if (pte_present(entry)) { > > + page = pte_page(entry); > > Should follow previous logic? > page = pte_page(entry) + ((address & ~huge_page_mask(h)) >> PAGE_SHIFT); > Yes, this needs to be PAGE aligned, not HUGETLB_PAGE aligned. -- Mike Kravetz > > + /* > > + * try_grab_page() should always succeed here, because we hold > > + * the ptl lock and have verified pte_present(). > > + */ > > + if (WARN_ON_ONCE(!try_grab_page(page, flags))) { > > + page = NULL; > > + goto out; > > + } > > + } else {