Re: [PATCH 1/4] mm: pagewalk: assert write mmap lock only for walking the user page tables

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Dec 2, 2023, at 17:25, Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote:
> 
> 
> 
> On 2023/12/2 16:08, Muchun Song wrote:
>>>> On Dec 1, 2023, at 19:09, Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote:
>>> 
>>> 
>>> 
>>> On 2023/11/27 16:46, Muchun Song wrote:
>>>> The 8782fb61cc848 ("mm: pagewalk: Fix race between unmap and page walker")
>>>> introduces an assertion to walk_page_range_novma() to make all the users
>>>> of page table walker is safe. However, the race only exists for walking the
>>>> user page tables. And it is ridiculous to hold a particular user mmap write
>>>> lock against the changes of the kernel page tables. So only assert at least
>>>> mmap read lock when walking the kernel page tables. And some users matching
>>>> this case could downgrade to a mmap read lock to relief the contention of
>>>> mmap lock of init_mm, it will be nicer in hugetlb (only holding mmap read
>>>> lock) in the next patch.
>>>> Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx>
>>>> ---
>>>>  mm/pagewalk.c | 29 ++++++++++++++++++++++++++++-
>>>>  1 file changed, 28 insertions(+), 1 deletion(-)
>>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>>>> index b7d7e4fcfad7a..f46c80b18ce4f 100644
>>>> --- a/mm/pagewalk.c
>>>> +++ b/mm/pagewalk.c
>>>> @@ -539,6 +539,11 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>>>>   * not backed by VMAs. Because 'unusual' entries may be walked this function
>>>>   * will also not lock the PTEs for the pte_entry() callback. This is useful for
>>>>   * walking the kernel pages tables or page tables for firmware.
>>>> + *
>>>> + * Note: Be careful to walk the kernel pages tables, the caller may be need to
>>>> + * take other effective approache (mmap lock may be insufficient) to prevent
>>>> + * the intermediate kernel page tables belonging to the specified address range
>>>> + * from being freed (e.g. memory hot-remove).
>>>>   */
>>>>  int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
>>>>    unsigned long end, const struct mm_walk_ops *ops,
>>>> @@ -556,7 +561,29 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
>>>>   if (start >= end || !walk.mm)
>>>>   return -EINVAL;
>>>>  - mmap_assert_write_locked(walk.mm);
>>>> + /*
>>>> + * 1) For walking the user virtual address space:
>>>> + *
>>>> + * The mmap lock protects the page walker from changes to the page
>>>> + * tables during the walk.  However a read lock is insufficient to
>>>> + * protect those areas which don't have a VMA as munmap() detaches
>>>> + * the VMAs before downgrading to a read lock and actually tearing
>>>> + * down PTEs/page tables. In which case, the mmap write lock should
>>>> + * be hold.
>>>> + *
>>>> + * 2) For walking the kernel virtual address space:
>>>> + *
>>>> + * The kernel intermediate page tables usually do not be freed, so
>>>> + * the mmap map read lock is sufficient. But there are some exceptions.
>>>> + * E.g. memory hot-remove. In which case, the mmap lock is insufficient
>>>> + * to prevent the intermediate kernel pages tables belonging to the
>>>> + * specified address range from being freed. The caller should take
>>>> + * other actions to prevent this race.
>>>> + */
>>>> + if (mm == &init_mm)
>>>> +    mmap_assert_locked(walk.mm);
>>>> + else
>>>> +    mmap_assert_write_locked(walk.mm);
>>> 
>>> Maybe just use process_mm_walk_lock() and set correct page_walk_lock in struct mm_walk_ops?
>> No. You also need to make sure the users do not pass the wrong
>> walk_lock, so you also need to add something like following:
> 
> But all other walk_page_XX has been converted,see more from commit
> 49b0638502da "mm: enable page walking API to lock vmas during the walk"),
> there's nothing special about this one, the calls must pass the right
> page_walk_lock to mm_walk_ops,

If you think this one is not special, why it is not converted by that commit at that time? 

> 
>> if (mm == &init_mm)
>>    VM_BUG_ON(walk_lock != PGWALK_RDLOCK);
>> else
>>    VM_BUG_ON(walk_lock == PGWALK_RDLOCK);
>> I do not think the code will be simple.
> 
> or adding the above lock check into process_mm_walk_lock too.

No. it’s wrong. walk_page_range_novma is special compared with other variants, the check is only applicable for walk_page_range_novma, not for its variants.




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux