On Mon, Sep 9, 2024, at 23:22, Charlie Jenkins wrote: > On Fri, Sep 06, 2024 at 10:52:34AM +0100, Lorenzo Stoakes wrote: >> On Fri, Sep 06, 2024 at 09:14:08AM GMT, Arnd Bergmann wrote: >> The intent is to optionally be able to run a process that keeps higher bits >> free for tagging and to be sure no memory mapping in the process will >> clobber these (correct me if I'm wrong Charlie! :) >> >> So you really wouldn't want this if you are using tagged pointers, you'd >> want to be sure literally nothing touches the higher bits. My understanding was that the purpose of the existing design is to allow applications to ask for a high address without having to resort to the complexity of MAP_FIXED. In particular, I'm sure there is precedent for applications that want both tagged pointers (for most mappings) and untagged pointers (for large mappings). With a per-mm_struct or per-task_struct setting you can't do that. > Various architectures handle the hint address differently, but it > appears that the only case across any architecture where an address > above 47 bits will be returned is if the application had a hint address > with a value greater than 47 bits and was using the MAP_FIXED flag. > MAP_FIXED bypasses all other checks so I was assuming that it would be > logical for MAP_FIXED to bypass this as well. If MAP_FIXED is not set, > then the intent is for no hint address to cause a value greater than 47 > bits to be returned. I don't think the MAP_FIXED case is that interesting here because it has to work in both fixed and non-fixed mappings. >> This would be more consistent vs. other arches. > > Yes riscv is an outlier here. The reason I am pushing for something like > a flag to restrict the address space rather than setting it to be the > default is it seems like if applications are relying on upper bits to be > free, then they should be explicitly asking the kernel to keep them free > rather than assuming them to be free. Let's see what the other architectures do and then come up with a way that fixes the pointer tagging case first on those that are broken. We can see if there needs to be an extra flag after that. Here is what I found: - x86_64 uses DEFAULT_MAP_WINDOW of BIT(47), uses a 57 bit address space when an addr hint is passed. - arm64 uses DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns higher 52-bit addresses when either a hint is passed or CONFIG_EXPERT and CONFIG_ARM64_FORCE_52BIT is set (this is a debugging option) - ppc64 uses a DEFAULT_MAP_WINDOW of BIT(47) or BIT(48), returns 52 bit address when an addr hint is passed - riscv uses a DEFAULT_MAP_WINDOW of BIT(47) but only uses it for allocating the stack below, ignoring it for normal mappings - s390 has no DEFAULT_MAP_WINDOW but tried to allocate in the current number of pgtable levels and only upgrades to the next level (31, 42, 53, 64 bits) if a hint is passed or the current level is exhausted. - loongarch64 has no DEFAULT_MAP_WINDOW, and a default VA space of 47 bits (16K pages, 3 levels), but can support a 55 bit space (64K pages, 3 levels). - sparc has no DEFAULT_MAP_WINDOW and up to 52 bit VA space. It may allocate both positive and negative addresses in there. (?) - mips64, parisc64 and alpha have no DEFAULT_MAP_WINDOW and at most 48, 41 or 39 address bits, respectively. I would suggest these changes: - make riscv enforce DEFAULT_MAP_WINDOW like x86_64, arm64 and ppc64, leave it at 47 - add DEFAULT_MAP_WINDOW on loongarch64 (47/48 bits based on page size), sparc (48 bits) and s390 (unsure if 42, 53, 47 or 48 bits) - leave the rest unchanged. Arnd