On 5/18/23 13:20, Jeff Xu wrote:>> Here's my concern about this whole thing: it's headed down a rabbit hole >> which is *highly* specialized both in the apps that will use it and the >> attacks it will mitigate. It probably *requires* turning off a bunch of >> syscalls (like io_uring) that folks kinda like in general. >> > ChromeOS currently disabled io_uring, but it is not required to do so. > io_uring supports the IORING_OP_MADVICE operation, which calls the > do_madvise() function. This means that io_uring will have the same > pkey checks as the madvice() system call. From that perspective, we > will fully support io_uring for this feature. io_uring fundamentally doesn't have the same checks. The kernel side work can be done from an asynchronous kernel thread. That kernel thread doesn't have a meaningful PKRU value. The register has a value, but it's not really related to the userspace threads that are sending it requests. >> We're balancing that highly specialized mitigation with a feature that >> add new ABI, touches core memory management code and signal handling. >> > The ABI change uses the existing flag field in pkey_alloc() which is > reserved. The implementation is backward compatible with all existing > pkey usages in both kernel and user space. Or do you have other > concerns about ABI in mind ? I'm not worried about the past, I'm worried any time we add a new ABI since we need to support it forever. > Yes, you are right about the risk of touching core mm code. To > minimize the risk, I try to control the scope of the change (it is > about 3 lines in mprotect, more in munmap but really just 3 effective > lines from syscall entry). I added new self-tests in mm to make sure > it doesn't regress in api behavior. I run those tests before and after > my kernel code change to make sure the behavior remains the same, I > tested it on 5.15 and 6.1 and 6.4-rc1. Actually, the testing > discovered a behavior change for mprotect() between 6.1 and 6.4 (not > from this patch, there are refactoring works going on in mm) see this > thread [1] > I hope those steps will help to mitigate the risk. > > Agreed on signaling handling is a tough part: what do you think about > the approach (modifying PKRU from saved stack after XSAVE), is there a > blocker ? Yes, signal entry and sigreturn are not necessarily symmetric so you can't really have a stack. >> On the x86 side, PKRU is a painfully special snowflake. It's exposed in >> the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel. >> This would be making it an even more special snowflake because it would > > I admit I'm quite ignorant on XSAVE to understand the above > statement, and how that is related. Could you explain it to me please > ? And what is in your mind that might improve the situation ? In a nutshell: XSAVE components are classified as either user or supervisor. User components can be modified from userspace and supervisor ones only from the kernel. In general, user components don't affect the kernel; the kernel doesn't care what is in ZMM11 (an XSAVE-managed register). That lets us do fun stuff like be lazy about when ZMM11 is saved/restored. Being lazy is good because it give us things like faster context switches and KVM VMEXIT handling. PKRU is a user component, but it affects the kernel when the kernel does copy_to/from_user() and friends. That means that the kernel can't do any "fun stuff" with PKRU. As soon as userspace provides a new value, the kernel needs to start respecting it. That makes PKRU a very special snowflake. So, even though PKRU can be managed by XSAVE, it isn't. It isn't kept in the kernel XSAVE buffer. But it *IS* in the signal stack XSAVE buffer. You *can* save/restore it with the other XSAVE components with ptrace(). The user<->kernel ABI pretends that PKRU is XSAVE managed even though it is not. All of this is special-cased. There's a ton of code to handle this mess. It's _complicated_. I haven't even started talking about how this interacts with KVM and guests. How could we improve it? A time machine would help to either change the architecture or have Linux ignore the fact that XSAVE knows anything about PKRU. So, the bar is pretty high for things that want to further muck with PKRU. Add signal and sigaltstack in particular into the fray, and we've got a recipe for disaster. sigaltstack and XSAVE don't really get along very well. https://lwn.net/Articles/862541/ >> need new altstack ABI and handling. >> > I thought adding protected memory support to signaling handling is an > independent project with its own weight. As Jann Horn points out in > [2]: "we could prevent the attacker from corrupting the signal > context if we can protect the signal stack with a pkey." However, > the kernel will send SIGSEGV when the stack is protected by PKEY, so > there is a benefit to make this work. (Maybe Jann can share some more > thoughts on the benefits) > > And I believe we could do this in a way with minimum ABI change, as below: > - allocate PKEY with a new flag (PKEY_ALTSTACK) > - at sigaltstack() call, detect the memory is PKEY_ALTSTACK protected, > (similar as what mprotect does in this patch) and save it along with > stack address/size. > - at signaling handling, use the saved info to fill in PKRU. > The ABI change is similar to PKEY_ENFORCE_API, and there is no > backward compatibility issue. > > Will these mentioned help our case ? What do you think ? To be honest, no. What you've laid out here is the tip of the complexity iceberg. There are a lot of pieces of the kernel that are not yet factored in. Let's also remember: protection keys is *NOT* a security feature. It's arguable that pkeys is a square peg trying to go into a round security hole.