On Thu, Aug 15, 2024 at 02:18:15PM +0100, Joey Gouly wrote: > Hi Catalin, > > On Wed, Aug 14, 2024 at 04:03:47PM +0100, Catalin Marinas wrote: > > Hi Joey, > > > > On Tue, Aug 06, 2024 at 03:31:03PM +0100, Joey Gouly wrote: > > > diff --git arch/arm64/kernel/signal.c arch/arm64/kernel/signal.c > > > index 561986947530..ca7d4e0be275 100644 > > > --- arch/arm64/kernel/signal.c > > > +++ arch/arm64/kernel/signal.c > > > @@ -1024,7 +1025,10 @@ static int setup_sigframe_layout(struct rt_sigframe_user_layout *user, > > > return err; > > > } > > > > > > - if (system_supports_poe()) { > > > + if (system_supports_poe() && > > > + (add_all || > > > + mm_pkey_allocation_map(current->mm) != 0x1 || > > > + read_sysreg_s(SYS_POR_EL0) != POR_EL0_INIT)) { > > > err = sigframe_alloc(user, &user->poe_offset, > > > sizeof(struct poe_context)); > > > if (err) > > > > > > > > > That is, we only save the POR_EL0 value if any pkeys have been allocated (other > > > than pkey 0) *or* if POR_EL0 is a non-default value. > > > > I had a chat with Dave as well on this and, in principle, we don't want > > to add stuff to the signal frame unnecessarily, especially for old > > binaries that have no clue of pkeys. OTOH, it looks like too complicated > > for just 16 bytes. Also POR_EL0 all RWX is a valid combination, I don't > > think we should exclude it. Unfortunately, this is always going to be the obviously simpler and more robust option for dealing with any new register state. In effect, the policy will be to push back on unconditional additions to the signal frame, except for 100% of proposed additions... I'm coming round to the view that trying to provide absolute guarantees about the signal frame size is unsustainable. x86 didn't, and got away with it for some time... Maybe we should just get rid of the relevant comments in headers, and water down guarantees in the SVE/SME documentation to recommendations with no promise attached? I can propose a patch for that. > > > > If no pkey has been allocated, I guess we could skip this and it also > > matches the x86 description of the PKRU being guaranteed to be preserved > > only for the allocated keys. Do we reserve pkey 0 for arm64? I thought > > that's only an x86 thing to emulate execute-only mappings. It's not clear whether pkey 0 is reserved in the sense of being permanently allocated, or in the sense of being unavailable for allocation. Since userspace gets pages with pkey 0 by default and can fiddle with the permissions on POR_EL0 and set this pkey onto pages using pkey_mprotect(), I'd say pkey 0 counts as always allocated; and the value of the POR_EL0 bits for pkey 0 needs to be maintained. > > To make it less complicated, I could drop the POR_EL0 check and just do: > > - if (system_supports_poe()) { > + if (system_supports_poe() && > + (add_all || > + mm_pkey_allocation_map(current->mm) != 0x1) { > > This wouldn't preserve the value of POR_EL0 if no pkeys had been allocated, but > that is fine, as you said / the man pages say. > > We don't preserve pkey 0, but it is the default for mappings and defaults to > RWX. So changing it probably will lead to unexpected things. > > > > > Another corner case would be the signal handler doing a pkey_alloc() and > > willing to populate POR_EL0 on sigreturn. It will have to find room in > > the signal handler, though I don't think that's a problem. > > pkey_alloc() doesn't appear in the signal safety man page, but that might just > be an omission due to permission keys being newer, than actually saying > pkey_alloc() can't be used. In practice this is likely to be a thin syscall wrapper; those are async-signal-safe in practice on Linux (but documentation tends to take a while to catch up). (Exceptions exists where "safe" calls are used in ways that interfere with the internal operation of libc... but those cases are mostly at least semi-obvious and rarely documented.) Using pkey_alloc() in a signal handler doesn't seem a great idea for more straightforward reasons, though: pkeys are a scarce, per-process resource, and allocating them asynchronously in the presence of other threads etc., doesn't seem like a recipe for success. I haven't looked, but I'd be surprised if there's any code doing this! Generally, it's too late to allocate any non-trivial kind of resource one you're in a signal handler... you need to plan ahead. > > If POR_EL0 isn't in the sig context, I think the signal handler could just > write the POR_EL0 system register directly? The kernel wouldn't restore POR_EL0 > in that case, so the value set in the signal handler would just be preserved. > > The reason that trying to preserve the value of POR_EL0 without any pkeys > allocated (like in the patch in my previous e-mail had) doesn't really make > sense, is that when you do pkey_alloc() you have to pass an initial value for > the pkey, so that will overwite what you may have manually written into > POR_EL0. Also you can't pass an unallocated pkey value to pkey_mprotect(). My argument here was that from the signal handler's point of view, the POR_EL0 value of the interrupted context lives in the sigframe if it's there (and will then be restored from there), and directly in POR_EL0 otherwise. Parsing the sigframe determine where the image of POR_EL0 is. I see two potential problems. 1) (probably not a big deal in practice) If the signal handler wants to withdraw a permission from pkey 0 for the interrupted context, and the interrupted context had no permission on any other pkey (so POR_EL0 is not dumped and the handler must update POR_EL0 directly before returning). In this scenario, the interrupted code would explode on return unless it can cope with globally execute-only or execute-read-only permissions. (no-execute is obviously dead on arrival). If a signal handler really really wanted to do this, it could return through an asm trampoline that is able to cope with the reduced permissions. This seems like a highly contrived scenario though, and I can't see how it could be useful... 2) (possibly a bigger deal) pkeys(7) does say explicitly (well, sort of) that the PKRU bits are restored on sigreturn. Since there are generic APIs to manipulate pkeys, it might cause problems if sigreturn restores the pkey permissions on some arches but not on others. Some non-x86-specific software might already be relying on the restoration of the permissions. > That's a lot of words to say, or ask, do you agree with the approach of only > saving POR_EL0 in the signal frame if num_allocated_pkeys() > 1? > > Thanks, > Joey ...So..., given all the above, it is perhaps best to go back to dumping POR_EL0 unconditionally after all, unless we have a mechanism to determine whether pkeys are in use at all. If we initially trapped POR_EL0, and set a flag the first time it is accessed or one of the pkeys syscalls is used, then we could dump POR_EL0 conditionally based on that: once the flag is set, we always dump it for that process. If the first POR_EL0 access or pkeys API interaction is in a signal handler, and that handler modifies POR_EL0, then it wouldn't get restored (or at least, not automatically). Not sure if this would ever matter in practice. It's potentially fiddly to make it work 100% consistently though (does a sigreturn count as a write? What if the first access to POR_EL0 is through ptrace, etc.?) Cheers ---Dave