On 07/07/2016 07:40 AM, Mel Gorman wrote: > Ok, so the last patch wired up the system call before the kernel was > tracking which numbers were in use. It doesn't really matter as such but > the patches should be swapped around and only expose the systemcall when > it's actually safe. I can do that. >> These system calls are also very important given the kernel's use >> of pkeys to implement execute-only support. These help ensure >> that userspace can never assume that it has control of a key >> unless it first asks the kernel. >> >> The 'init_access_rights' argument to pkey_alloc() specifies the >> rights that will be established for the returned pkey. For >> instance: >> >> pkey = pkey_alloc(flags, PKEY_DENY_WRITE); >> >> will allocate 'pkey', but also sets the bits in PKRU[1] such that >> writing to 'pkey' is already denied. This keeps userspace from >> needing to have knowledge about manipulating PKRU with the >> RDPKRU/WRPKRU instructions. Userspace is still free to use these >> instructions as it wishes, but this facility ensures it is no >> longer required. >> >> The kernel does _not_ enforce that this interface must be used for >> changes to PKRU, even for keys it does not control. >> >> The kernel does not prevent pkey_free() from successfully freeing >> in-use pkeys (those still assigned to a memory range by >> pkey_mprotect()). It would be expensive to implement the checks >> for this, so we instead say, "Just don't do it" since sane >> software will never do it anyway. > > Unfortunately, it could manifest as either corruption due to an area > expected to be protected being accessible or an unexpected SEGV. > > I accept the expensive arguement but it opens a new class of problems > that userspace debuggers will need to evaluate. Yeah. I guess it would be good to have a debugging mechanism here at least. >> +static inline >> +bool mm_pkey_is_allocated(struct mm_struct *mm, unsigned long pkey) >> +{ >> + if (!validate_pkey(pkey)) >> + return true; >> + >> + return mm_pkey_allocation_map(mm) & (1 << pkey); >> +} >> + > > We flip-flop between whether pkey is signed or unsigned. Yeah, I can add some consistency here, for sure. >> +SYSCALL_DEFINE2(pkey_alloc, unsigned long, flags, unsigned long, init_val) >> +{ >> + int pkey; >> + int ret; >> + >> + /* No flags supported yet. */ >> + if (flags) >> + return -EINVAL; >> + /* check for unsupported init values */ >> + if (init_val & ~PKEY_ACCESS_MASK) >> + return -EINVAL; >> + >> + down_write(¤t->mm->mmap_sem); >> + pkey = mm_pkey_alloc(current->mm); >> + >> + ret = -ENOSPC; >> + if (pkey == -1) >> + goto out; >> + >> + ret = arch_set_user_pkey_access(current, pkey, init_val); >> + if (ret) { >> + mm_pkey_free(current->mm, pkey); >> + goto out; >> + } >> + ret = pkey; >> +out: >> + up_write(¤t->mm->mmap_sem); >> + return ret; >> +} > > It's not wrong as such but mmap_sem taken for write seems *extremely* > heavy to protect the allocation mask. If userspace is racing a key > allocation with mprotect, it's already game over in terms of random > behaviour. > > I've no idea what the frequency of pkey alloc/free is expected to be. If > it's really low then maybe it doesn't matter but if it's high this is > going to be a bottleneck later. I think pkey_alloc() is fundamentally less frequent than mprotect(). If you're doing a pkey_alloc() it's because you want to set it on at least one memory area, which means at least one mprotect(). So, at _worst_, it's 1:1. If you've got more than one thing you're protecting, you'll have many mprotect()s for each pkey_alloc(). The real reason I did this, though, was to avoid having _some_ other lock. It'll cost more storage space, have more locking rules and I need exclusion against pkey_mprotect() which already holds mmap_sem for write. IOW, I think this was the simplest thing to do. -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html