Re: [PATCH v2 1/5] mm/hugetlb: fix races when looking up a CONT-PTE size hugetlb page

David Hildenbrand <david@xxxxxxxxxx> · Wed, 24 Aug 2022 17:13:08 +0200

On 24.08.22 17:06, Baolin Wang wrote:
> 
> 
> On 8/24/2022 10:33 PM, David Hildenbrand wrote:
>> On 24.08.22 16:30, Baolin Wang wrote:
>>>
>>>
>>> On 8/24/2022 7:55 PM, David Hildenbrand wrote:
>>>> On 24.08.22 11:41, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 8/24/2022 3:31 PM, David Hildenbrand wrote:
>>>>>>>>>>
>>>>>>>>>> IMHO, these follow_huge_xxx() functions are arch-specified at first and
>>>>>>>>>> were moved into the common hugetlb.c by commit 9e5fc74c3025 ("mm:
>>>>>>>>>> hugetlb: Copy general hugetlb code from x86 to mm"), and now there are
>>>>>>>>>> still some arch-specified follow_huge_xxx() definition, for example:
>>>>>>>>>> ia64: follow_huge_addr
>>>>>>>>>> powerpc: follow_huge_pd
>>>>>>>>>> s390: follow_huge_pud
>>>>>>>>>>
>>>>>>>>>> What I mean is that follow_hugetlb_page() is a common and
>>>>>>>>>> not-arch-specified function, is it suitable to change it to be
>>>>>>>>>> arch-specified?
>>>>>>>>>> And thinking more, can we rename follow_hugetlb_page() as
>>>>>>>>>> hugetlb_page_faultin() and simplify it to only handle the page faults of
>>>>>>>>>> hugetlb like the faultin_page() for normal page? That means we can make
>>>>>>>>>> sure only follow_page_mask() can handle hugetlb.
>>>>>>>>>>
>>>>>>>>
>>>>>>>> Something like that might work, but you still have two page table walkers
>>>>>>>> for hugetlb.  I like David's idea (if I understand it correctly) of
>>>>>>>
>>>>>>> What I mean is we may change the hugetlb handling like normal page:
>>>>>>> 1) use follow_page_mask() to look up a hugetlb firstly.
>>>>>>> 2) if can not get the hugetlb, then try to page fault by
>>>>>>> hugetlb_page_faultin().
>>>>>>> 3) if page fault successed, then retry to find hugetlb by
>>>>>>> follow_page_mask().
>>>>>>
>>>>>> That implies putting more hugetlbfs special code into generic GUP,
>>>>>> turning it even more complicated. But of course, it depends on how the
>>>>>> end result looks like. My gut feeling was that hugetlb is better handled
>>>>>> in follow_hugetlb_page() separately (just like we do with a lot of other
>>>>>> page table walkers).
>>>>>
>>>>> OK, fair enough.
>>>>>
>>>>>>>
>>>>>>> Just a rough thought, and I need more investigation for my idea and
>>>>>>> David's idea.
>>>>>>>
>>>>>>>> using follow_hugetlb_page for both cases.  As noted, it will need to be
>>>>>>>> taught how to not trigger faults in the follow_page_mask case.
>>>>>>>
>>>>>>> Anyway, I also agree we need some cleanup, and firstly I think we should
>>>>>>> cleanup these arch-specified follow_huge_xxx() on some architectures
>>>>>>> which are similar with the common ones. I will look into these.
>>>>>>
>>>>>> There was a recent discussion on that, e.g.:
>>>>>>
>>>>>> https://lkml.kernel.org/r/20220818135717.609eef8a@thinkpad
>>>>>
>>>>> Thanks.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> However, considering cleanup may need more investigation and
>>>>>>> refactoring, now I prefer to make these bug-fix patches of this patchset
>>>>>>> into mainline firstly, which are suitable to backport to old version to
>>>>>>> fix potential race issues. Mike and David, how do you think? Could you
>>>>>>> help to review these patches? Thanks.
>>>>>>
>>>>>> Patch #1 certainly add more special code just to handle another hugetlb
>>>>>> corner case (CONT pages), and maybe just making it all use
>>>>>> follow_hugetlb_page() would be even cleaner and less error prone.
>>>>>>
>>>>>> I agree that locking is shaky, but I'm not sure if we really want to
>>>>>> backport this to stable trees:
>>>>>>
>>>>>> https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
>>>>>>
>>>>>> "It must fix a real bug that bothers people (not a, “This could be a
>>>>>> problem...” type thing)."
>>>>>>
>>>>>>
>>>>>> Do we actually have any instance of this being a real (and not a
>>>>>> theoretical) problem? If not, I'd rather clean it all up right away.
>>>>>
>>>>> I think this is a real problem (not theoretical), and easy to write some
>>>>> code to show the issue. For example, suppose thread A is trying to look
>>>>> up a CONT-PTE size hugetlb page under the lock, however antoher thread B
>>>>> can migrate the CONT-PTE hugetlb page at the same time, which will cause
>>>>> thread A to get an incorrect page, if thread A want to do something for
>>>>> this incorrect page, error occurs.
>>>>>
>>>>> Actually we also want to backport these fixes to the distro with old
>>>>> kernel versions to make the hugetlb more stable. Otherwise we must hit
>>>>> these issues sooner or later if the customers use CONT-PTE/PMD hugetlb.
>>>>>
>>>>> Anyway, if you and Mike still think these issues are not important
>>>>> enough to be fixed in the old versions, I can do the cleanup firstly.
>>>>>
>>>>
>>>> [asking myself which follow_page() users actually care about hugetlb,
>>>> and why we need this handling in follow_page at all]
>>>>
>>>> Which follow_page() user do we care about here? Primarily mm/migrate.c
>>>> only I assume?
>>>
>>> Right, mainly affects the move_pages() syscall I think. Yes, I can not
>>> know all of the users of the move_pages() syscall now or in the future
>>> in our data center, but like I said the move_pages() syscall + hugetlb
>>> can be a real potential stability issue.
>>>
>>
>> I wonder if we can get rid of follow_page() completely, there are not
>> too many users. Or alternatively simply make it use general GUP
>> infrastructure more clearly. We'd need something like FOLL_NOFAULT that
>> also covers "absolutely no faults".
> 
> I am not sure I get your point. So you want change to use 
> __get_user_pages() (or silimar wrappers) to look up a normal page or 
> hugetlb instead of follow_page()? and adding a new FOLL_NOFAULT flag to 
> __get_user_pages().

Essentially just getting rid of follow_page() completely or making it a
wrapper of __get_user_pages().

> 
> If I understand correctly, we still need more work to move those 
> arch-specified follow_huge_xxx() into follow_hugetlb_page() firstly like 
> we disscussed before? Which seems not backportable too.

I'm not sure we need all that magic in these arch specific helpers after
all. I haven't looked into the details, but I really wonder why they
handle something that follow_hugetlb_page() cannot easily handle. It all
smells like legacy cruft.

> 
> I am not againt your idea, and I also agree that we should do some 
> cleanup. But the point is if we need backport patches to fix this issue, 
> which affects move_pages() syscall, if the answer is yes, I think my 
> current fixes are suitable to backport.

I really don't like adding more make-legacy-cruft-happy code unless
there is *real* need for it. (you could always just fix old kernels you
care about with your patches here -- do they have to be in mainline?
don't think so)

But of course, it's up to Mike to decide, just my 2 cents :)

-- 
Thanks,

David / dhildenb