On Fri, Aug 16, 2024 at 12:21:18PM -0700, Sean Christopherson wrote: > On Thu, Aug 15, 2024, Jason Gunthorpe wrote: > > This is used by x86 CPUs and can be used in both x86 IOMMUs. When the x86 > > IOMMU is running SVA it is using this page table format. > > > > This implementation follows the AMD v2 io-pgtable version. > > > > There is nothing remarkable here, the format has a variable top and > > limited support for different page sizes and no contiguous pages support. > > > > In principle this can support the 32 bit configuration with fewer table > > levels. > > What's "the 32 bit configuration"? Oh, the three level version > > +/* SPDX-License-Identifier: GPL-2.0-only */ > > +/* > > + * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES > > + * > > + * x86 PAE page table > > + * > > + * This is described in > > + * Section "4.4 PAE Paging" of the Intel Software Developer's Manual Volume 3 > > I highly doubt what's implemented here is actually PAE paging, as the SDM (that > is referenced above) and most x86 folks describe PAE paging. PAE paging is > specifically used when the CPU is in 32-bit mode (NOT including compatibility mode!). > > PAE paging translates 32-bit linear addresses to 52-bit physical addresses. > Presumably what's implemented here is what Intel calls 4-level and 5-level paging. > Those are _really_ similar to PAE paging, I *think* this needs to support the three level "PAE" format as well? I'm not really sure yet, but it looked to me like really old non-scalable Intel vt-d iommu might need it?? Or maybe it uses the vtdss format with 3 levels.. It is not something I've checked deeply into yet. So the intention was to also capture the 32 bit PAE format with three levels and the 4/5 level as well. The idea will be that like arm and others you select how many levels you want when you init the table. If the three level format needs some bit adjustments also it would be done with a feature bit. I haven't yet got to comparing against the bit patterns the VT-D driver uses for this file, but I expect any differences are minor. > Unfortuntately, I have no idea what name to use for this flavor. x86pae is > actually kinda good, but I think it'll be confusing to people that are familiar > with the more canonical version of PAE paging. I struggled too. The name wasn't good to me either.. I think if this is confusing lets just call it x86_64? Sort of a focus on the new. > > + * Section "2.2.6 I/O Page Tables for Guest Translations" of the "AMD I/O > > + * Virtualization Technology (IOMMU) Specification" > > + * > > + * It is used by x86 CPUs and The AMD and VT-D IOMMU HW. > > + * > > + * The named levels in the spec map to the pts->level as: > > + * Table/PTE - 0 > > + * Directory/PDE - 1 > > + * Directory Ptr/PDPTE - 2 > > + * PML4/PML4E - 3 > > + * PML5/PML5E - 4 > > Any particularly reason not to use x86's (and KVM's) effective 1-based system? > (level '0' is essentially the 4KiB leaf entries in a page table) Not any super strong one. The math is slightly more natural with 0 based, for instance the most general version is arm 32 bit: return PT_GRANUAL_LG2SZ + (PT_TABLEMEM_LG2SZ - ilog2(sizeof(u32))) * pts->level; This is the only case where PT_GRANUAL_LG2SZ=12 and PT_TABLEMEM_LG2SZ=10, so the above needs no adjustment to level. It also ensures that 0 is not an invalid value that needs to be considered, and that little detail helps a few micro optimization. Every document seems to have its own take of this, the intel/amd ones all like to start at 1 and go up, the ARM ones are reversed and start at 4 and goes down to 0. > Starting at '1' is kinda odd, but it aligns with thing like PML4/5, > allows using the pg_level enums from x86, and diverging from both > x86 MM and KVM is likely going to confuse people. And ARM people will say not using their totally different numbers confuses them. I feel there is no winning here. So I went with something mathematically clean and assumed we'd have this discussion :) At the end of the day the intersting stuff is done using the generic code and API, so that can't make assumptions about the structure from any of the documents. In that regard having it be different from everything else (because it has to be a superset of everything else) is not necessarily a bad thing. In truth the number of places where you have to look at level is really pretty small so I felt this was OK. > > +/* Shared descriptor bits */ > > +enum { > > + X86PAE_FMT_P = BIT(0), > > + X86PAE_FMT_RW = BIT(1), > > + X86PAE_FMT_U = BIT(2), > > + X86PAE_FMT_A = BIT(5), > > + X86PAE_FMT_D = BIT(6), > > + X86PAE_FMT_OA = GENMASK_ULL(51, 12), > > + X86PAE_FMT_XD = BIT_ULL(63), > > Any reason not to use the #defines in arch/x86/include/asm/pgtable_types.h? This is arch independent code, I don't think I can include that header from here?? I've seen Linus be negative about wild ../../ includes.. Keeping everything here arch independent is one of the big value adds here, IMHO. > > +static inline bool x86pae_pt_install_table(struct pt_state *pts, > > + pt_oaddr_t table_pa, > > + const struct pt_write_attrs *attrs) > > +{ > > + u64 *tablep = pt_cur_table(pts, u64); > > + u64 entry; > > + > > + /* > > + * FIXME according to the SDM D is ignored by HW on table pointers? > > Correct, only leaf entries have dirty bits. To add some colour, this logic is exactly matching the bit patterns that existing amd v2 iommu code is creating. It looks to me like the AMD IOMMU manual also says it ignores this bit in table levels, so it is possibly a little mistake in the existing code. I'll make a little patch fixing it and ask them that way.. > > + * io_pgtable_v2 sets it > > + */ > > + entry = X86PAE_FMT_P | X86PAE_FMT_RW | X86PAE_FMT_U | X86PAE_FMT_A | > > What happens with the USER bit for I/O page tables? Ignored, I assume? Not ignored, it does stuff. AMD's IOMMU manual says: U/S: User/Supervisor. IOMMU uses same meaning as AMD64 processor page tables. 0=access is restricted to supervisor level. 1=both user and supervisor access is allowed. Software Note: For a peripheral not using U/S, software should program the bit to signal user mode. If MMIO Offset 0030h[USSup] = 0, this field is ignored IIRC it also comes out through the ATS replies. I expect VTD is similar. Thanks for checking, your remarks in the room at LPC were inspiring this was the right way to go! Thanks, Jason