On Fri, Mar 10, 2023 at 1:43 PM Edgecombe, Rick P <rick.p.edgecombe@xxxxxxxxx> wrote: > > On Fri, 2023-03-10 at 13:00 -0800, Deepak Gupta wrote: > > On Fri, Mar 10, 2023 at 12:14:01AM +0000, Edgecombe, Rick P wrote: > > > On Thu, 2023-03-09 at 13:08 -0800, Deepak Gupta wrote: > > > > On Thu, Mar 09, 2023 at 07:39:41PM +0000, Edgecombe, Rick P > > > > wrote: > > > > > On Thu, 2023-03-09 at 10:55 -0800, Deepak Gupta wrote: > > > > > > On Thu, Mar 02, 2023 at 05:22:07PM +0000, Szabolcs Nagy > > > > > > wrote: > > > > > > > The 02/27/2023 14:29, Rick Edgecombe wrote: > > > > > > > > Previously, a new PROT_SHADOW_STACK was attempted, > > > > > > > > > > > > > > ... > > > > > > > > So rather than repurpose two existing syscalls (mmap, > > > > > > > > madvise) > > > > > > > > that don't > > > > > > > > quite fit, just implement a new map_shadow_stack syscall > > > > > > > > to > > > > > > > > allow > > > > > > > > userspace to map and setup new shadow stacks in one step. > > > > > > > > While > > > > > > > > ucontext > > > > > > > > is the primary motivator, userspace may have other > > > > > > > > unforeseen > > > > > > > > reasons to > > > > > > > > setup it's own shadow stacks using the WRSS instruction. > > > > > > > > Towards > > > > > > > > this > > > > > > > > provide a flag so that stacks can be optionally setup > > > > > > > > securely > > > > > > > > for the > > > > > > > > common case of ucontext without enabling WRSS. Or > > > > > > > > potentially > > > > > > > > have the > > > > > > > > kernel set up the shadow stack in some new way. > > > > > > > > > > > > > > ... > > > > > > > > The following example demonstrates how to create a new > > > > > > > > shadow > > > > > > > > stack with > > > > > > > > map_shadow_stack: > > > > > > > > void *shstk = map_shadow_stack(addr, stack_size, > > > > > > > > SHADOW_STACK_SET_TOKEN); > > > > > > > > > > > > > > i think > > > > > > > > > > > > > > mmap(addr, size, PROT_READ, MAP_ANON|MAP_SHADOW_STACK, -1, > > > > > > > 0); > > > > > > > > > > > > > > could do the same with less disruption to users (new > > > > > > > syscalls > > > > > > > are harder to deal with than new flags). it would do the > > > > > > > guard page and initial token setup too (there is no flag > > > > > > > for > > > > > > > it but could be squeezed in). > > > > > > > > > > > > Discussion on this topic in v6 > > > > > > > > > > > > > > > > > > > > > > > https://lore.kernel.org/all/20230223000340.GB945966@xxxxxxxxxxxxxxxxxxxxx/ > > > > > > > > > > > > Again I know earlier CET patches had protection flag and > > > > > > somehow > > > > > > due > > > > > > to pushback > > > > > > on mailing list, > > > > > > it was adopted to go for special syscall because no one else > > > > > > had shadow stack. > > > > > > > > > > > > Seeing a response from Szabolcs, I am assuming arm4 would > > > > > > also > > > > > > want > > > > > > to follow > > > > > > using mmap to manufacture shadow stack. For reference RFC > > > > > > patches > > > > > > for > > > > > > risc-v shadow stack, > > > > > > use a new protection flag = PROT_SHADOWSTACK. > > > > > > > > > > > > > > > > > > > > > > > https://lore.kernel.org/lkml/20230213045351.3945824-1-debug@xxxxxxxxxxxx/ > > > > > > > > > > > > I know earlier discussion had been that we let this go and do > > > > > > a > > > > > > re- > > > > > > factor later as other > > > > > > arch support trickle in. But as I thought more on this and I > > > > > > think it > > > > > > may just be > > > > > > messy from user mode point of view as well to have cognition > > > > > > of > > > > > > two > > > > > > different ways of > > > > > > creating shadow stack. One would be special syscall (in > > > > > > current > > > > > > libc) > > > > > > and another `mmap` > > > > > > (whenever future re-factor happens) > > > > > > > > > > > > If it's not too late, it would be more wise to take `mmap` > > > > > > approach rather than special `syscall` approach. > > > > > > > > > > There is sort of two things intermixed here when we talk about > > > > > a > > > > > PROT_SHADOW_STACK. > > > > > > > > > > One is: what is the interface for specifying how the shadow > > > > > stack > > > > > should be provisioned with data? Right now there are two ways > > > > > supported, all zero or with an X86 shadow stack restore token > > > > > at > > > > > the > > > > > end. Then there was already some conversation about a third > > > > > type. > > > > > In > > > > > which case the question would be is using mmap MAP_ flags the > > > > > right > > > > > place for this? How many types of initialization will be needed > > > > > in > > > > > the > > > > > end and what is the overlap between the architectures? > > > > > > > > First of all, arches can choose to have token at the bottom or > > > > not. > > > > > > > > Token serve following purposes > > > > - It allows one to put desired value in shadow stack pointer in > > > > safe/secure manner. > > > > Note: x86 doesn't provide any opcode encoding to value in SSP > > > > register. So having > > > > a token is kind of a necessity because x86 doesn't easily > > > > allow > > > > writing shadow stack. > > > > > > > > - A token at the bottom acts marker / barrier and can be useful > > > > in > > > > debugging > > > > > > > > - If (and a big *if*) we ever reach a point in future where > > > > return > > > > address is only pushed > > > > on shadow stack (x86 should have motivation to do this > > > > because > > > > less uops on call/ret), > > > > a token at the bottom (bottom means lower address) is > > > > ensuring > > > > sure shot way of getting > > > > a fault when exhausted. > > > > > > > > Current RISCV zisslpcfi proposal doesn't define CPU based tokens > > > > because it's RISC. > > > > It allows mechanisms using which software can define formatting > > > > of > > > > token for itself. > > > > Not sure of what ARM is doing. > > > > > > Ok, so riscv doesn't need to have the kernel write the token, but > > > x86 > > > does. > > > > > > > > > > > Now coming to the point of all zero v/s shadow stack token. > > > > Why not always have token at the bottom? > > > > > > With WRSS you can setup the shadow stack however you want. So the > > > user > > > would then have to take care to erase the token if they didn't want > > > it. > > > Not the end of the world, but kind of clunky if there is no reason > > > for > > > it. > > > > Yes but kernel always assumes the user is going to use the token. It' > > upto the user > > to decide whether they want to use the restore token or not. If > > they've WRSS capability > > security posture is anyways diluted. An attacker who would be clever > > enough to > > re-use `RSTORSSP` present in address space to restore using kernel > > prepared token, should > > anyways can be clever enough to use WRSS as well. > > > > It kind of makes shadow stack creation simpler for kernel to always > > place the token. > > This point is irrespective of whether to use system call or mmap. > > Think about like CRIU restoring the shadow stack, or other special > cases like that. Userspace can always overwrite the token, but this > involves some amount of extra work (extra writes, earlier faulting in > the page, etc). It is clunky and very negligibly worse. Earlier faulting in the page because the kernel is writing the token at base? > > > > > > > > > > > > > > In case of x86, Why need for two ways and why not always have a > > > > token > > > > at the bottom. > > > > The way x86 is going, user mode is responsible for establishing > > > > shadow stack and thus > > > > whenever shadow stack is created then if x86 kernel > > > > implementation > > > > always place a token > > > > at the base/bottom. > > > > > > There was also some discussion recently of adding a token AND an > > > end of > > > stack marker, as a potential solution for backtracing in ucontext > > > stacks. In this case it could cause an ABI break to just start > > > adding > > > the end of stack marker where the token was, and so would require a > > > new > > > map_shadow_stack flag. > > > > Was this discussed why restore token itself can't be used as marker > > for > > end of stack (if we assume there is always going to be one at the > > bottom). > > It's a unique value. An address pointing to itself. > > I thought the same thing at first, but it gets clobbered during the > pivot and push. aah I remember. It was changed from savessp/rstorssp pair to rstorssp/saveprevssp to follow the `make before break` model. > > > > > > > > > > > > > > Now user mode can do following:-- > > > > - If it has access to WRSS, it can sure go ahead and create a > > > > token > > > > of its choosing and > > > > overwrite kernel created token. and then do RSTORSSP on it's > > > > own > > > > created token. > > > > > > > > - If it doesn't have access to WRSS (and dont need to create > > > > its > > > > own token), it can do > > > > RSTORSSP on this. As soon as it does, no other thread in > > > > process > > > > can restore to it. > > > > On `fork`, you get the same un-restorable token. > > > > > > > > So why not always have a token at the bottom. > > > > This is my plan for riscv implementation as well (to have a token > > > > at > > > > the bottom) > > > > > > > > > > > > > > The other thing is: should shadow stack memory creation be > > > > > tightly > > > > > controlled? For example in x86 we limit this to anonymous > > > > > memory, > > > > > etc. > > > > > Some reasons for this are x86 specific, but some are not. So if > > > > > we > > > > > disallow most of the options why allow the interface to take > > > > > them? > > > > > And > > > > > then you are in the position of carefully maintaining a list of > > > > > not- > > > > > allowed options instead letting a list of allowed options sit > > > > > there. > > > > > > > > I am new to linux kernel and thus may be not able to follow the > > > > argument of > > > > limiting to anonymous memory. > > > > > > > > Why is limiting it to anonymous memory a problem. IIRC, ARM's > > > > PROT_MTE is applicable > > > > only to anonymous memory. I can probably find few more examples. > > > > > > Oh I see, they have a special arch VMA flag VM_MTE_ALLOWED that > > > only > > > gets set if all the rules are followed. Then PROT_MTE can only be > > > set > > > on that to set VM_MTE. That is kind of nice because certain other > > > special situations can choose to support it. > > > > That's because MTE is different. It allows to assign tags to existing > > virtual memory. So one need to know whether a memory can have tags > > assigned. > > > > > > > > It does take another arch vma flag though. For x86 I guess I would > > > need > > > to figure out how to squeeze VM_SHADOW_STACK into other flags to > > > have a > > > free flag to use the same method. It also only supports mprotect() > > > and > > > shadow stack would only want to support mmap(). And you still have > > > the > > > initialization stuff to plumb through. Yea, I think the PROT_MTE is > > > a > > > good thing to consider, but it's not super obvious to me how > > > similar > > > the logic would be for shadow stack. > > > > I dont think you need another VMA flag. Memory tagging allows adding > > tags > > to existing virtual memory. > > ...need another VMA flag to use the existing mmap arch breakouts in the > same way as VM_MTE. Of course changing mmap makes other solutions > possible. > > > That's why having `mprotect` makes sense for MTE. > > In shadow stack case, there is no requirement of changing a shadow > > stack > > to regular memory or vice-versa. > > uffd needs mprotect internals. You might take a look at it in regards > to your VM_WRITE/mprotect blocking approach for riscv. I was imagining, > even if mmap was the syscall, mprotect() would not be blocked in the > x86 case at least. The mprotect() blocking is a separate thing than the > syscall, right? Yes, mprotect blocking is a different thing. VM_XXX flags are not exposed to mprotect (or any memory mapping API). PROT_XXX flags are. On riscv, in my current plan if mprotect or mmap specifies PROT_WRITE (no PROT_READ), It'll be mapped to `VM_READ | VM_WRITE` on vma flags (this is to make sure we don't break compat with existing user code which has been using only PROT_WRITE) If PROT_SHADOWSTACK (new protection flag) is specified, it'll be mapped to `VM_WRITE` on the vma_flag. Yes I am aware of uffd. I intend to handle it the same way I am handling the fork of riscv shadow stack. core-mm write protect checks if VM_WRiTE is specified it'll convert PTE encodings to read-only. uffd for regular memory should work as it is. In case if someone was monitoring shadow stack memory, following could occur 1) A write happens on shadow stack memory, a store page fault would occur. 2) A read happens, this would be allowed. 3) A shadow stack load / store happens, a store access fault would occur. Case 1 and 3 are reported to the kernel and it can make sure the uffd monitor is reported about it. Was there a specific concern here with respect to uffd and x86 shadow stack? > > > > > All that's needed to change is `mmap`. `mprotect` should fail. > > Syscall > > approach gives that benefit by default because there is no protection > > flag > > for shadow stack. > > > > I was giving example that any feature which gives new meaning to > > virtual memory > > has been able to work with existing memory mapping APIs without the > > need of new > > system call (including whether you're dealing with anonymous memory). > > > > > > > > The question I'm asking though is, not "can mmap code and rules be > > > changed to enforce the required limitations?". I think it is yes. > > > But > > > the question is "why is that plumbing better than a new syscall?". > > > I > > > guess to get a better idea, the mmap solution would need to get > > > POCed. > > > I had half done this at one point, but abandoned the approach. > > > > > > For your question about why limit it, the special x86 case is the > > > Dirty=1,Write=0 PTE bit combination for shadow stacks. So for > > > shadow > > > stack you could have some confusion about whether a PTE is actually > > > dirty for writeback, etc. I wouldn't say it's known to be > > > impossible to > > > do MAP_SHARED, but it has not been fully analyzed enough to know > > > what > > > the changes would be. There were some solvable concrete issues that > > > tipped the scale as well. It was also not expected to be a common > > > usage, if at all. > > > > I am not sure how confusion of D=1,W=0 is not completely taken away > > by > > syscall approach. It'll always be there. One can only do things to > > minimize > > the chances. > > > > In case of syscall approach, syscall makes sure that > > > > `flags = MAP_ANONYMOUS | MAP_PRIVATE | MAP_ABOVE4G` > > > > This can be easily checked in arch specific landing function for > > mmap. > > Right, this is why I listed two types of things in the mix here. The > memory features supported, and what the syscall is. You asked why limit > the memory features, so that is the explanation. > > > > > > > Additionally, If you always have the token at base, you don't need > > that ABI > > between user and kernel. > > > > > > > > > > The non-x86, general reasons for it, are for a smaller benefit. It > > > blocks a lot of ways shadow stack memory could be written to. Like > > > say > > > you have a memory mapped writable file, and you also map it shadow > > > stack. So it has better security properties depending on what your > > > threat model is. > > > > I wouldn't say any architecture should allow such primitives. It kind > > of defeats > > the purpose for shadow stack. Yes if some sort of secure memory is > > needed, there may > > be new ISA extensions for that. > > Yea, seems reasonable to prevent this regardless of the extra x86 > reasons, if that is what you are saying. It depends on people's threat > models (as always in security). > > > > > > > > > > > > > > Eventually syscall will also go ahead and use memory management > > > > code > > > > to > > > > perform mapping. So I didn't understand the reasoning here. The > > > > way > > > > syscall > > > > can limit it to anonymous memory, why mmap can't do the same if > > > > it > > > > sees > > > > PROT_SHADOWSTACK. > > > > > > > > > > > > > > The only benefit I've heard is that it saves creating a new > > > > > syscall, > > > > > but it also saves several MAP_ flags. That, and that the RFC > > > > > for > > > > > riscv > > > > > did a PROT_SHADOW_STACK to start. So, yes, two people asked the > > > > > same > > > > > question, but I'm still not seeing any benefits. Can you give > > > > > the > > > > > pros > > > > > and cons please? > > > > > > > > Again the way syscall will limit it to anonymous memory, Why mmap > > > > can't do same? > > > > There is precedence for it (like PROT_MTE is applicable only to > > > > anonymous memory) > > > > > > > > So if it can be done, then why introduce a new syscall? > > > > > > > > > > > > > > BTW, in glibc map_shadow_stack is called from arch code. So I > > > > > think > > > > > userspace wise, for this to affect other architectures there > > > > > would > > > > > need > > > > > to be some code that could do things generically, with somehow > > > > > the > > > > > shadow stack pivot abstracted but the shadow stack allocation > > > > > not. > > > > > > > > Agreed, yes it can be done in a way where it won't put tax on > > > > other > > > > architectures. > > > > > > > > But what about fragmentation within x86. Will x86 always choose > > > > to > > > > use system call > > > > method map shadow stack. If future re-factor results in x86 also > > > > use > > > > `mmap` method. > > > > Isn't it a mess for x86 glibc to figure out what to do; whether > > > > to > > > > use system call > > > > or `mmap`? > > > > > > > > > > Ok, so this is the downside I guess. What happens if we want to > > > support > > > the other types of memory in the future and end up using mmap for > > > this? > > > Then we have 15-20 lines of extra syscall wrapping code to maintain > > > to > > > support legacy. > > > > > > For the mmap solution, we have the downside of using extra MAP_ > > > flags, > > > and *some* amount of currently unknown vm_flag and address range > > > logic, > > > plus mmap arch breakouts to add to core MM. Like I said earlier, > > > you > > > would need to POC it out to see how bad that looks and get some > > > core MM > > > feedback on the new type of MAP flag usage. But, syscalls being > > > pretty > > > straightforward, it would probably be *some* amount of added > > > complexity > > > _now_ to support something that might happen in the future. I'm not > > > seeing either one as a landslide win. > > > > > > It's kind of an eternal software design philosophical question, > > > isn't > > > it? How much work should you do to prepare for things that might be > > > needed in the future? From what I've seen the balance in the kernel > > > seems to be to try not to paint yourself in to an ABI corner, but > > > otherwise let the kernel evolve naturally in response to real > > > usages. > > > If anyone wants to correct this, please do. But otherwise I think > > > the > > > new syscall is aligned with that. > > > > > > TBH, you are making me wonder if I'm missing something. It seems > > > you > > > strongly don't prefer this approach, but I'm not hearing any huge > > > potential negative impacts. And you also say it won't tax the riscv > > > implementation. Is this just something just smells bad here? Or it > > > would shrink the riscv series? > > > > No you're not missing anything. It's just wierdness of adding a > > system call > > which enforces certain MAP_XX flags and pretty much mapping API. > > And difference between architectures on how they will create shadow > > stack. + > > if x86 chooses to use `mmap` in future, then there is ugliness in > > user mode to > > decide which method to choose. > > Ok, I think I will leave it given it's entirely in arch/x86. It just > got some special error codes in the other thread today too. > > > > > And yes you got it right, to some extent there is my own selfishness > > playing out > > as well here to reduce riscv patches. > > > > Feel free to join the map_shadow_stack party. :) I am warming up to it :-)