Re: [PATCH V2 7/7] mm: Use pgdp_get() for accessing PGD entries

Ryan Roberts <ryan.roberts@xxxxxxx> · Thu, 19 Sep 2024 19:49:09 +0200

On 19/09/2024 18:06, Russell King (Oracle) wrote:
> On Thu, Sep 19, 2024 at 05:48:58PM +0200, Ryan Roberts wrote:
>>> 32-bit arm uses, in some circumstances, an array because each level 1
>>> page table entry is actually two descriptors. It needs to be this way
>>> because each level 2 table pointed to by each level 1 entry has 256
>>> entries, meaning it only occupies 1024 bytes in a 4096 byte page.
>>>
>>> In order to cut down on the wastage, treat the level 1 page table as
>>> groups of two entries, which point to two consecutive 1024 byte tables
>>> in the level 2 page.
>>>
>>> The level 2 entry isn't suitable for the kernel's use cases (there are
>>> no bits to represent accessed/dirty and other important stuff that the
>>> Linux MM wants) so we maintain the hardware page tables and a separate
>>> set that Linux uses in the same page. Again, the software tables are
>>> consecutive, so from Linux's perspective, the level 2 page tables
>>> have 512 entries in them and occupy one full page.
>>>
>>> This is documented in arch/arm/include/asm/pgtable-2level.h
>>>
>>> However, what this means is that from the software perspective, the
>>> level 1 page table descriptors are an array of two entries, both of
>>> which need to be setup when creating a level 2 page table, but only
>>> the first one should ever be dereferenced when walking the tables,
>>> otherwise the code that walks the second level of page table entries
>>> will walk off the end of the software table into the actual hardware
>>> descriptors.
>>>
>>> I've no idea what the idea is behind introducing pgd_get() and what
>>> it's semantics are, so I can't comment further.
>>
>> The helper is intended to read the value of the entry pointed to by the passed
>> in pointer. And it shoiuld be read in a "single copy atomic" manner, meaning no
>> tearing. Further, the PTL is expected to be held when calling the getter. If the
>> HW can write to the entry such that its racing with the lock holder (i.e. HW
>> update of access/dirty) then READ_ONCE() should be suitable for most
>> architectures. If there is no possibility of racing (because HW doesn't write to
>> the entry), then a simple dereference would be sufficient, I think (which is
>> what the core code was already doing in most cases).
> 
> The core code should be making no access to the PGD entries on 32-bit
> ARM since the PGD level does not exist. Writes are done at PMD level
> in arch code. Reads are done by core code at PMD level.
> 
> It feels to me like pgd_get() just doesn't fit the model to which 32-bit
> ARM was designed to use decades ago, so I want full details about what
> pgd_get() is going to be used for and how it is going to be used,
> because I feel completely in the dark over this new development. I fear
> that someone hasn't understood the Linux page table model if they're
> wanting to access stuff at levels that effectively "aren't implemented"
> in the architecture specific kernel model of the page tables.

This change isn't as big and scary as I think you fear. The core-mm today
dereferences pgd pointers (and p4d, pud, pmd pointers) directly in its code. See
follow_pfnmap_start(), gup_fast_pgd_leaf(), and many other sites. These changes
aim to abstract those dereferences into an inline function that the architecture
can override and implement if it so wishes.

The core-mm implements default versions of these helper functions which do
READ_ONCE(), but does not currently use them consistently.