On 2024/8/16 3:20, Peter Xu wrote:
On Wed, Aug 14, 2024 at 09:37:15AM -0300, Jason Gunthorpe wrote:
Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.
There is definitely interest here in extending ARM to support the 1G
size too, what is missing?
Currently PUD pfnmap relies on THP_PUD config option:
config ARCH_SUPPORTS_PUD_PFNMAP
def_bool y
depends on ARCH_SUPPORTS_HUGE_PFNMAP && HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
Arm64 unfortunately doesn't yet support dax 1G, so not applicable yet.
Ideally, pfnmap is too simple comparing to real THPs and it shouldn't
require to depend on THP at all, but we'll need things like below to land
first:
https://lore.kernel.org/r/20240717220219.3743374-1-peterx@xxxxxxxxxx
I sent that first a while ago, but I didn't collect enough inputs, and I
decided to unblock this series from that, so x86_64 shouldn't be affected,
and arm64 will at least start to have 2M.
The other trick is how to allow gup-fast working for such huge mappings
even if there's no direct sign of knowing whether it's a normal page or
MMIO mapping. This series chose to keep the pte_special solution, so that
it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so that
gup-fast will be able to identify them and fail properly.
Make sense
More architectures / More page sizes
------------------------------------
Currently only x86_64 (2M+1G) and arm64 (2M) are supported.
For example, if arm64 can start to support THP_PUD one day, the huge pfnmap
on 1G will be automatically enabled.
A draft patch to enable THP_PUD on arm64, only passed with
DEBUG_VM_PGTABLE, we may test pud pfnmaps on arm64.
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2f8ff354ca6..ff0d27c72020 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -184,6 +184,7 @@ config ARM64
select HAVE_ARCH_THREAD_STRUCT_WHITELIST
select HAVE_ARCH_TRACEHOOK
select HAVE_ARCH_TRANSPARENT_HUGEPAGE
+ select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if PGTABLE_LEVELS > 2
select HAVE_ARCH_VMAP_STACK
select HAVE_ARM_SMCCC
select HAVE_ASM_MODVERSIONS
diff --git a/arch/arm64/include/asm/pgtable.h
b/arch/arm64/include/asm/pgtable.h
index 7a4f5604be3f..e013fe458476 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -763,6 +763,25 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)
#define pud_valid(pud) pte_valid(pud_pte(pud))
#define pud_user(pud) pte_user(pud_pte(pud))
#define pud_user_exec(pud) pte_user_exec(pud_pte(pud))
+#define pud_dirty(pud) pte_dirty(pud_pte(pud))
+#define pud_devmap(pud) pte_devmap(pud_pte(pud))
+#define pud_wrprotect(pud) pte_pud(pte_wrprotect(pud_pte(pud)))
+#define pud_mkold(pud) pte_pud(pte_mkold(pud_pte(pud)))
+#define pud_mkwrite(pud) pte_pud(pte_mkwrite_novma(pud_pte(pud)))
+#define pud_mkclean(pud) pte_pud(pte_mkclean(pud_pte(pud)))
+#define pud_mkdirty(pud) pte_pud(pte_mkdirty(pud_pte(pud)))
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static inline int pud_trans_huge(pud_t pud)
+{
+ return pud_val(pud) && pud_present(pud) && !(pud_val(pud) &
PUD_TABLE_BIT);
+}
+
+static inline pud_t pud_mkdevmap(pud_t pud)
+{
+ return pte_pud(set_pte_bit(pud_pte(pud), __pgprot(PTE_DEVMAP)));
+}
+#endif
static inline bool pgtable_l4_enabled(void);
@@ -1137,10 +1156,20 @@ static inline int pmdp_set_access_flags(struct
vm_area_struct *vma,
pmd_pte(entry), dirty);
}
+static inline int pudp_set_access_flags(struct vm_area_struct *vma,
+ unsigned long address, pud_t *pudp,
+ pud_t entry, int dirty)
+{
+ return __ptep_set_access_flags(vma, address, (pte_t *)pudp,
+ pud_pte(entry), dirty);
+}
+
+#ifndef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
static inline int pud_devmap(pud_t pud)
{
return 0;
}
+#endif
static inline int pgd_devmap(pgd_t pgd)
{
@@ -1213,6 +1242,13 @@ static inline int
pmdp_test_and_clear_young(struct vm_area_struct *vma,
{
return __ptep_test_and_clear_young(vma, address, (pte_t *)pmdp);
}
+
+static inline int pudp_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long address,
+ pud_t *pudp)
+{
+ return __ptep_test_and_clear_young(vma, address, (pte_t *)pudp);
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
static inline pte_t __ptep_get_and_clear(struct mm_struct *mm,
@@ -1433,6 +1469,7 @@ static inline void update_mmu_cache_range(struct
vm_fault *vmf,
#define update_mmu_cache(vma, addr, ptep) \
update_mmu_cache_range(NULL, vma, addr, ptep, 1)
#define update_mmu_cache_pmd(vma, address, pmd) do { } while (0)
+#define update_mmu_cache_pud(vma, address, pud) do { } while (0)
#ifdef CONFIG_ARM64_PA_BITS_52
#define phys_to_ttbr(addr) (((addr) | ((addr) >> 46)) &
TTBR_BADDR_MASK_52)
--
2.27.0
Oh that sounds like a bigger step..
Just to mention, no real THP 1G needed here for pfnmaps. The real gap here
is only about the pud helpers that only exists so far with CONFIG_THP_PUD
in huge_memory.c.
VFIO is so far the only consumer for the huge pfnmaps after this series
applied. Besides above remap_pfn_range() generic optimization, device
driver can also try to optimize its mmap() on a better VA alignment for
either PMD/PUD sizes. This may, iiuc, normally require userspace changes,
as the driver doesn't normally decide the VA to map a bar. But I don't
think I know all the drivers to know the full picture.
How does alignment work? In most caes I'm aware of the userspace does
not use MAP_FIXED so the expectation would be for the kernel to
automatically select a high alignment. I suppose your cases are
working because qemu uses MAP_FIXED and naturally aligns the BAR
addresses?
- x86_64 + AMD GPU
- Needs Alex's modified QEMU to guarantee proper VA alignment to make
sure all pages to be mapped with PUDs
Oh :(
So I suppose this answers above. :) Yes, alignment needed.