On 24.08.22 17:06, Baolin Wang wrote: > > > On 8/24/2022 10:33 PM, David Hildenbrand wrote: >> On 24.08.22 16:30, Baolin Wang wrote: >>> >>> >>> On 8/24/2022 7:55 PM, David Hildenbrand wrote: >>>> On 24.08.22 11:41, Baolin Wang wrote: >>>>> >>>>> >>>>> On 8/24/2022 3:31 PM, David Hildenbrand wrote: >>>>>>>>>> >>>>>>>>>> IMHO, these follow_huge_xxx() functions are arch-specified at first and >>>>>>>>>> were moved into the common hugetlb.c by commit 9e5fc74c3025 ("mm: >>>>>>>>>> hugetlb: Copy general hugetlb code from x86 to mm"), and now there are >>>>>>>>>> still some arch-specified follow_huge_xxx() definition, for example: >>>>>>>>>> ia64: follow_huge_addr >>>>>>>>>> powerpc: follow_huge_pd >>>>>>>>>> s390: follow_huge_pud >>>>>>>>>> >>>>>>>>>> What I mean is that follow_hugetlb_page() is a common and >>>>>>>>>> not-arch-specified function, is it suitable to change it to be >>>>>>>>>> arch-specified? >>>>>>>>>> And thinking more, can we rename follow_hugetlb_page() as >>>>>>>>>> hugetlb_page_faultin() and simplify it to only handle the page faults of >>>>>>>>>> hugetlb like the faultin_page() for normal page? That means we can make >>>>>>>>>> sure only follow_page_mask() can handle hugetlb. >>>>>>>>>> >>>>>>>> >>>>>>>> Something like that might work, but you still have two page table walkers >>>>>>>> for hugetlb. I like David's idea (if I understand it correctly) of >>>>>>> >>>>>>> What I mean is we may change the hugetlb handling like normal page: >>>>>>> 1) use follow_page_mask() to look up a hugetlb firstly. >>>>>>> 2) if can not get the hugetlb, then try to page fault by >>>>>>> hugetlb_page_faultin(). >>>>>>> 3) if page fault successed, then retry to find hugetlb by >>>>>>> follow_page_mask(). >>>>>> >>>>>> That implies putting more hugetlbfs special code into generic GUP, >>>>>> turning it even more complicated. But of course, it depends on how the >>>>>> end result looks like. My gut feeling was that hugetlb is better handled >>>>>> in follow_hugetlb_page() separately (just like we do with a lot of other >>>>>> page table walkers). >>>>> >>>>> OK, fair enough. >>>>> >>>>>>> >>>>>>> Just a rough thought, and I need more investigation for my idea and >>>>>>> David's idea. >>>>>>> >>>>>>>> using follow_hugetlb_page for both cases. As noted, it will need to be >>>>>>>> taught how to not trigger faults in the follow_page_mask case. >>>>>>> >>>>>>> Anyway, I also agree we need some cleanup, and firstly I think we should >>>>>>> cleanup these arch-specified follow_huge_xxx() on some architectures >>>>>>> which are similar with the common ones. I will look into these. >>>>>> >>>>>> There was a recent discussion on that, e.g.: >>>>>> >>>>>> https://lkml.kernel.org/r/20220818135717.609eef8a@thinkpad >>>>> >>>>> Thanks. >>>>> >>>>>> >>>>>>> >>>>>>> However, considering cleanup may need more investigation and >>>>>>> refactoring, now I prefer to make these bug-fix patches of this patchset >>>>>>> into mainline firstly, which are suitable to backport to old version to >>>>>>> fix potential race issues. Mike and David, how do you think? Could you >>>>>>> help to review these patches? Thanks. >>>>>> >>>>>> Patch #1 certainly add more special code just to handle another hugetlb >>>>>> corner case (CONT pages), and maybe just making it all use >>>>>> follow_hugetlb_page() would be even cleaner and less error prone. >>>>>> >>>>>> I agree that locking is shaky, but I'm not sure if we really want to >>>>>> backport this to stable trees: >>>>>> >>>>>> https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html >>>>>> >>>>>> "It must fix a real bug that bothers people (not a, “This could be a >>>>>> problem...” type thing)." >>>>>> >>>>>> >>>>>> Do we actually have any instance of this being a real (and not a >>>>>> theoretical) problem? If not, I'd rather clean it all up right away. >>>>> >>>>> I think this is a real problem (not theoretical), and easy to write some >>>>> code to show the issue. For example, suppose thread A is trying to look >>>>> up a CONT-PTE size hugetlb page under the lock, however antoher thread B >>>>> can migrate the CONT-PTE hugetlb page at the same time, which will cause >>>>> thread A to get an incorrect page, if thread A want to do something for >>>>> this incorrect page, error occurs. >>>>> >>>>> Actually we also want to backport these fixes to the distro with old >>>>> kernel versions to make the hugetlb more stable. Otherwise we must hit >>>>> these issues sooner or later if the customers use CONT-PTE/PMD hugetlb. >>>>> >>>>> Anyway, if you and Mike still think these issues are not important >>>>> enough to be fixed in the old versions, I can do the cleanup firstly. >>>>> >>>> >>>> [asking myself which follow_page() users actually care about hugetlb, >>>> and why we need this handling in follow_page at all] >>>> >>>> Which follow_page() user do we care about here? Primarily mm/migrate.c >>>> only I assume? >>> >>> Right, mainly affects the move_pages() syscall I think. Yes, I can not >>> know all of the users of the move_pages() syscall now or in the future >>> in our data center, but like I said the move_pages() syscall + hugetlb >>> can be a real potential stability issue. >>> >> >> I wonder if we can get rid of follow_page() completely, there are not >> too many users. Or alternatively simply make it use general GUP >> infrastructure more clearly. We'd need something like FOLL_NOFAULT that >> also covers "absolutely no faults". > > I am not sure I get your point. So you want change to use > __get_user_pages() (or silimar wrappers) to look up a normal page or > hugetlb instead of follow_page()? and adding a new FOLL_NOFAULT flag to > __get_user_pages(). Essentially just getting rid of follow_page() completely or making it a wrapper of __get_user_pages(). > > If I understand correctly, we still need more work to move those > arch-specified follow_huge_xxx() into follow_hugetlb_page() firstly like > we disscussed before? Which seems not backportable too. I'm not sure we need all that magic in these arch specific helpers after all. I haven't looked into the details, but I really wonder why they handle something that follow_hugetlb_page() cannot easily handle. It all smells like legacy cruft. > > I am not againt your idea, and I also agree that we should do some > cleanup. But the point is if we need backport patches to fix this issue, > which affects move_pages() syscall, if the answer is yes, I think my > current fixes are suitable to backport. I really don't like adding more make-legacy-cruft-happy code unless there is *real* need for it. (you could always just fix old kernels you care about with your patches here -- do they have to be in mainline? don't think so) But of course, it's up to Mike to decide, just my 2 cents :) -- Thanks, David / dhildenb