On Sat, Aug 10, 2024 at 12:06:12AM +0100, Mark Brown wrote: > On Fri, Aug 09, 2024 at 07:19:26PM +0100, Catalin Marinas wrote: > > On Thu, Aug 08, 2024 at 09:15:25AM +0100, Mark Brown wrote: > > > + /* This should really be an atomic cmpxchg. It is not. */ > > > + if (access_remote_vm(mm, addr, &val, sizeof(val), > > > + FOLL_FORCE) != sizeof(val)) > > > + goto out; > > > If we restrict the shadow stack creation only to the CLONE_VM case, we'd > > not need the remote vm access, it's in the current mm context already. > > More on this below. > > The discussion in previous iterations was that it seemed better to allow > even surprising use cases since it simplifies the analysis of what we > have covered. If the user has specified a shadow stack we just do what > they asked for and let them worry about if it's useful. Thanks for the summary of the past discussions, the patch makes more sense now. I guess it's easier to follow a clone*() syscall where one can set a new stack pointer even in the !CLONE_VM case. Just let it set the shadow stack as well with the new ABI. However, the x86 would be slightly inconsistent here between clone() and clone3(). I guess it depends how you look at it. The classic clone() syscall, if one doesn't pass CLONE_VM but does set new stack, there's no new shadow stack allocated which I'd expect since it's a new stack. Well, I doubt anyone cares about this scenario. Are there real cases of !CLONE_VM but with a new stack? > > > + if (val != expected) > > > + goto out; > > > I'm confused that we need to consume the token here. I could not find > > the default shadow stack allocation doing this, only setting it via > > create_rstor_token() (or I did not search enough). In the default case, > > is the user consuming it? To me the only difference should been the > > default allocation vs the one passed by the user via clone3(), with the > > latter maybe requiring the user to set the token initially. > > As discussed for a couple of previous versions if we don't have the > token and userspace can specify any old shadow stack page as the shadow > stack this allows clone3() to be used to overwrite the shadow stack of > another thread, you can point to a shadow stack page which is currently > in use and then run some code that causes shadow stack writes. This > could potentially then in turn be used as part of a bigger exploit > chain, probably it's hard to get anything beyond just causing the other > thread to fault but won't be impossible. > > With a kernel allocated shadow stack this is not an issue since we are > placing the shadow stack in new memory, userspace can't control where we > place it so it can't overwrite an existing shadow stack. IIUC, the kernel-allocated shadow stack will have the token always set while the user-allocated one will be cleared. I was looking to understand the inconsistency between these two cases in terms of the final layout of the new shadow stack: one with the token, the other without. I can see the need for checking but maybe start with requiring it to be 0 and setting the token before returning, for consistency with clone(). In the kernel-allocated shadow stack, is the token used for anything? I can see it's used for signal delivery and return but I couldn't figure out what it is used for in a thread's shadow stack. Also, can one not use the clone3() to point to the clone()-allocated shadow stack? Maybe that's unlikely as an app tends to stick to one syscall flavour or the other. > > > + /* > > > + * For CLONE_VFORK the child will share the parents > > > + * shadow stack. Make sure to clear the internal > > > + * tracking of the thread shadow stack so the freeing > > > + * logic run for child knows to leave it alone. > > > + */ > > > + if (clone_flags & CLONE_VFORK) { > > > + shstk->base = 0; > > > + shstk->size = 0; > > > + return 0; > > > + } > > > I think we should leave the CLONE_VFORK check on its own independent of > > the clone3() arguments. If one passes both CLONE_VFORK and specific > > shadow stack address/size, they should be ignored (or maybe return an > > error if you want to make it stricter). > > This is existing logic from the current x86 code that's been reindented > due to the addition of explicitly specified shadow stacks, it's not new > behaviour. It is needed to stop the child thinking it has the parent's > shadow stack in the CLONE_VFORK case. I figured that. But similar to the current !CLONE_VM behaviour where no new shadow stack is allocated even if a new stack is passed to clone(), I was thinking of something similar here for consistency: don't set up a shadow stack in the CLONE_VFORK case or at least allow it only if a new stack is being set up (if we extend this to clone(), it would be a small ABI change). > > > - /* > > > - * For !CLONE_VM the child will use a copy of the parents shadow > > > - * stack. > > > - */ > > > - if (!(clone_flags & CLONE_VM)) > > > - return 0; > > > + /* > > > + * For !CLONE_VM the child will use a copy of the > > > + * parents shadow stack. > > > + */ > > > + if (!(clone_flags & CLONE_VM)) > > > + return 0; > > > Is the !CLONE_VM case specific only to the default shadow stack > > allocation? Sorry if this has been discussed already (or I completely > > forgot) but I thought we'd only implement this for the thread creation > > case. The typical fork() for a new process should inherit the parent's > > layout, so applicable to the clone3() with the shadow stack arguments as > > well (which should be ignored or maybe return an error with !CLONE_VM). > > This is again all existing behaviour for the case where the user has not > specified a shadow stack reindented, as mentioned above if the user has > specified one explicitly then we just do what we were asked. The > existing behaviour is to only create a new shadow stack for the child in > the CLONE_VM case and leave the child using the same shadow stack as the > parent in the copied mm for !CLONE_VM. I guess I was rather questioning the current choices than the new clone3() ABI. But even for the new clone3() ABI, does it make sense to set up a shadow stack if the current stack isn't changed? We'll end up with a lot of possible combinations that will never get tested but potentially become obscure ABI. Limiting the options to the sane choices only helps with validation and unsurprising changes later on. > > > @@ -2790,6 +2808,8 @@ pid_t kernel_clone(struct kernel_clone_args *args) > > > */ > > > trace_sched_process_fork(current, p); > > > > > > + shstk_post_fork(p, args); > > > Do we need this post fork call? Can we not handle the setup via the > > copy_thread() path in shstk_alloc_thread_stack()? > > It looks like we do actually have the new mm in the process before we > call copy_thread() so we could move things into there though we'd loose > a small bit of factoring out of the error handling (at one point I had > more code factored out but right now it's quite small, looking again we > could also factor out the get_task_mm()/mput()). ISTR having the new > process' mm was the biggest reason for this initially but looking again > I'm not sure why that was. It does still feel like even the small > amount that's factored out currently is useful though, a bit less > duplication in the architecture code which feels welcome here. I think you can probably keep this. My comment was based on the assumption that we only support the CLONE_VM case where we wouldn't need the access_remote_vm(), just some direct write similar to write_user_shstk_64(). I still think we should have limited this ABI to the CLONE_VM and !CLONE_VFORK cases but I don't have a strong view if the consensus was to allow it for classic fork() and vfork() like uses (I just think they won't be used). -- Catalin