Re: [PATCH] hugetlb: simplify hugetlb handling in follow_page_mask

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Le 05/09/2022 à 11:46, David Hildenbrand a écrit :
> On 05.09.22 11:33, Christophe Leroy wrote:
>>
>>
>> Le 05/09/2022 à 10:37, David Hildenbrand a écrit :
>>> On 03.09.22 09:07, Christophe Leroy wrote:
>>>> +Resending with valid powerpc list address
>>>>
>>>> Le 02/09/2022 à 20:52, David Hildenbrand a écrit :
>>>>>>>> Adding Christophe on Cc:
>>>>>>>>
>>>>>>>> Christophe do you know if is_hugepd is true for all hugetlb
>>>>>>>> entries, not
>>>>>>>> just hugepd?
>>>>
>>>> is_hugepd() is true if and only if the directory entry points to a huge
>>>> page directory and not to the normal lower level directory.
>>>>
>>>> As far as I understand if the directory entry is not pointing to any
>>>> lower directory but is a huge page entry, pXd_leaf() is true.
>>>>
>>>>
>>>>>>>>
>>>>>>>> On systems without hugepd entries, I guess ptdump skips all
>>>>>>>> hugetlb entries.
>>>>>>>> Sigh!
>>>>
>>>> As far as I can see, ptdump_pXd_entry() handles the pXd_leaf() case.
>>>>
>>>>>>>
>>>>>>> IIUC, the idea of ptdump_walk_pgd() is to dump page tables even
>>>>>>> outside
>>>>>>> VMAs (for debugging purposes?).
>>>>>>>
>>>>>>> I cannot convince myself that that's a good idea when only 
>>>>>>> holding the
>>>>>>> mmap lock in read mode, because we can just see page tables getting
>>>>>>> freed concurrently e.g., during concurrent munmap() ... while 
>>>>>>> holding
>>>>>>> the mmap lock in read we may only walk inside VMA boundaries.
>>>>>>>
>>>>>>> That then raises the questions if we're only calling this on
>>>>>>> special MMs
>>>>>>> (e.g., init_mm) whereby we cannot really see concurrent munmap() and
>>>>>>> where we shouldn't have hugetlb mappings or hugepd entries.
>>>>
>>>> At least on powerpc, PTDUMP handles only init_mm.
>>>>
>>>> Hugepage are used at least on powerpc 8xx for linear memory mapping, 
>>>> see
>>>>
>>>> commit 34536d780683 ("powerpc/8xx: Add a function to early map kernel
>>>> via huge pages")
>>>> commit cf209951fa7f ("powerpc/8xx: Map linear memory with huge pages")
>>>>
>>>> hugepds may also be used in the future to use huge pages for vmap and
>>>> vmalloc, see commit a6a8f7c4aa7e ("powerpc/8xx: add support for huge
>>>> pages on VMAP and VMALLOC")
>>>>
>>>> As far as I know, ppc64 also use huge pages for VMAP and VMALLOC, see
>>>>
>>>> commit d909f9109c30 ("powerpc/64s/radix: Enable HAVE_ARCH_HUGE_VMAP")
>>>> commit 8abddd968a30 ("powerpc/64s/radix: Enable huge vmalloc mappings")
>>>
>>> There is a difference between an ordinary huge mapping (e.g., as used
>>> for THP) and a a hugetlb mapping.
>>>
>>> Our current understanding is that hugepd only applies to hugetlb.
>>> Wouldn't vmap/vmalloc user ordinary huge pmd entries instead of hugepd?
>>>
>>
>> 'hugepd' stands for huge page directory. It is independant of whether a
>> huge page is used for hugetlb or for anything else, it represents the
>> way pages are described in the page tables.
> 
> This patch here makes the assumption that hugepd only applies to 
> hugetlb, because it removes any such handling from the !hugetlb path in 
> GUP. Is that incorrect or are there valid cases where that could happen? 
> (init_mm is special in that regard, i don't think it interacts with GUP 
> at all).

You are correct I think, for user pages hugepd only applies to hugetlb.

> 
>>
>> I don't know what you mean by _ordinary_ huge pmd entry.
>>
> 
> Essentially, what we use for THP. Let me try to understand how hugepd 
> interact with the rest of the system.
> 
> Do systems that support hugepd currently implement THP? Reading above 
> 32bit systems below, I assume not?

Right, as far as I understand only leaf huge pages are handled by THP as 
far as I understand.

> 
>> Let's take the exemple of powerpc 8xx which is the one I know best. This
>> is a powerpc32, so it has two levels : PGD and PTE. PGD has 1024 entries
>> and each entry covers a 4Mbytes area. Normal PTE has 1024 entries and
>> each entry is a 4k page. When you use 8Mbytes pages, you don't use PTEs
>> as it would be a waste of memory. You use a huge page directory that has
>> a single entry, and you have two PGD entries pointing to the huge page
>> directory.
> 
> Thanks, I assume there are no 8MB THP, correct?

Correct.

> 
> The 8MB example with 4MB PGD entries makes it sound a bit like the 
> cont-PTE/cont-PMD handling on aarch64: they don't use a hugepd but would 
> simply let two consecutive PGD entries point at the the relevant (sub) 
> parts of the hugetlb page. No hugepd involved.

Yes it is my feeling as well.

Allthough in the case of the powerpc 8xx we really need a PGD entry + a 
page entry in order to use the hardware assisted page table walk and 
also to populate L1 and L2 TLB entries without to many processing in the 
TLB-miss interrupt handler.

> 
>>
>> Some time ago, hupgepd was also used for 512kbytes pages and 16kbytes
>> pages:
>> - there was huge page directories with 8x 512kbytes pages,
>> - there was huge page directories with 256x 16kbytes pages,
>>
>> And the PGD/PMD entry points to a huge page directory (HUGEPD) instead
>> of pointing to a page table directory (PTE).
> 
> Thanks for the example.
> 
>>
>> Since commit b250c8c08c79 ("powerpc/8xx: Manage 512k huge pages as
>> standard pages."), the 8xx doesn't use anymore hugepd for 512k huge
>> page, but other platforms like powerpc book3e extensively use huge page
>> directories.
>>
>> I hope this clarifies the subject, otherwise I'm happy to provide
>> further details.
> 
> Thanks, it would be valuable to know if the assumption in this patch is 
> correct: hugepd will only be found in hugetlb areas in ordinary MMs (not 
> init_mm).
> 

Yes I think the assumption is correct for user pages hence for GUP.

By the way the discussion started with PTDUMP. For PTDUMP we need huge 
page directories to be taken into account. And for anything that 
involves kernel pages like VMAP or VMALLOC.

Christophe




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux