On 19/09/2024 18:06, Russell King (Oracle) wrote: > On Thu, Sep 19, 2024 at 05:48:58PM +0200, Ryan Roberts wrote: >>> 32-bit arm uses, in some circumstances, an array because each level 1 >>> page table entry is actually two descriptors. It needs to be this way >>> because each level 2 table pointed to by each level 1 entry has 256 >>> entries, meaning it only occupies 1024 bytes in a 4096 byte page. >>> >>> In order to cut down on the wastage, treat the level 1 page table as >>> groups of two entries, which point to two consecutive 1024 byte tables >>> in the level 2 page. >>> >>> The level 2 entry isn't suitable for the kernel's use cases (there are >>> no bits to represent accessed/dirty and other important stuff that the >>> Linux MM wants) so we maintain the hardware page tables and a separate >>> set that Linux uses in the same page. Again, the software tables are >>> consecutive, so from Linux's perspective, the level 2 page tables >>> have 512 entries in them and occupy one full page. >>> >>> This is documented in arch/arm/include/asm/pgtable-2level.h >>> >>> However, what this means is that from the software perspective, the >>> level 1 page table descriptors are an array of two entries, both of >>> which need to be setup when creating a level 2 page table, but only >>> the first one should ever be dereferenced when walking the tables, >>> otherwise the code that walks the second level of page table entries >>> will walk off the end of the software table into the actual hardware >>> descriptors. >>> >>> I've no idea what the idea is behind introducing pgd_get() and what >>> it's semantics are, so I can't comment further. >> >> The helper is intended to read the value of the entry pointed to by the passed >> in pointer. And it shoiuld be read in a "single copy atomic" manner, meaning no >> tearing. Further, the PTL is expected to be held when calling the getter. If the >> HW can write to the entry such that its racing with the lock holder (i.e. HW >> update of access/dirty) then READ_ONCE() should be suitable for most >> architectures. If there is no possibility of racing (because HW doesn't write to >> the entry), then a simple dereference would be sufficient, I think (which is >> what the core code was already doing in most cases). > > The core code should be making no access to the PGD entries on 32-bit > ARM since the PGD level does not exist. Writes are done at PMD level > in arch code. Reads are done by core code at PMD level. > > It feels to me like pgd_get() just doesn't fit the model to which 32-bit > ARM was designed to use decades ago, so I want full details about what > pgd_get() is going to be used for and how it is going to be used, > because I feel completely in the dark over this new development. I fear > that someone hasn't understood the Linux page table model if they're > wanting to access stuff at levels that effectively "aren't implemented" > in the architecture specific kernel model of the page tables. This change isn't as big and scary as I think you fear. The core-mm today dereferences pgd pointers (and p4d, pud, pmd pointers) directly in its code. See follow_pfnmap_start(), gup_fast_pgd_leaf(), and many other sites. These changes aim to abstract those dereferences into an inline function that the architecture can override and implement if it so wishes. The core-mm implements default versions of these helper functions which do READ_ONCE(), but does not currently use them consistently.