Re: [PATCH RFT v8 4/9] fork: Add shadow stack support to clone3()

Catalin Marinas <catalin.marinas@xxxxxxx> · Tue, 13 Aug 2024 17:25:47 +0100

On Sat, Aug 10, 2024 at 12:06:12AM +0100, Mark Brown wrote:
> On Fri, Aug 09, 2024 at 07:19:26PM +0100, Catalin Marinas wrote:
> > On Thu, Aug 08, 2024 at 09:15:25AM +0100, Mark Brown wrote:
> > > +	/* This should really be an atomic cmpxchg.  It is not. */
> > > +	if (access_remote_vm(mm, addr, &val, sizeof(val),
> > > +			     FOLL_FORCE) != sizeof(val))
> > > +		goto out;
> 
> > If we restrict the shadow stack creation only to the CLONE_VM case, we'd
> > not need the remote vm access, it's in the current mm context already.
> > More on this below.
> 
> The discussion in previous iterations was that it seemed better to allow
> even surprising use cases since it simplifies the analysis of what we
> have covered.  If the user has specified a shadow stack we just do what
> they asked for and let them worry about if it's useful.

Thanks for the summary of the past discussions, the patch makes more
sense now. I guess it's easier to follow a clone*() syscall where one
can set a new stack pointer even in the !CLONE_VM case. Just let it set
the shadow stack as well with the new ABI.

However, the x86 would be slightly inconsistent here between clone() and
clone3(). I guess it depends how you look at it. The classic clone()
syscall, if one doesn't pass CLONE_VM but does set new stack, there's no
new shadow stack allocated which I'd expect since it's a new stack.
Well, I doubt anyone cares about this scenario. Are there real cases of
!CLONE_VM but with a new stack?

> > > +	if (val != expected)
> > > +		goto out;
> 
> > I'm confused that we need to consume the token here. I could not find
> > the default shadow stack allocation doing this, only setting it via
> > create_rstor_token() (or I did not search enough). In the default case,
> > is the user consuming it? To me the only difference should been the
> > default allocation vs the one passed by the user via clone3(), with the
> > latter maybe requiring the user to set the token initially.
> 
> As discussed for a couple of previous versions if we don't have the
> token and userspace can specify any old shadow stack page as the shadow
> stack this allows clone3() to be used to overwrite the shadow stack of
> another thread, you can point to a shadow stack page which is currently
> in use and then run some code that causes shadow stack writes.  This
> could potentially then in turn be used as part of a bigger exploit
> chain, probably it's hard to get anything beyond just causing the other
> thread to fault but won't be impossible.
> 
> With a kernel allocated shadow stack this is not an issue since we are
> placing the shadow stack in new memory, userspace can't control where we
> place it so it can't overwrite an existing shadow stack.

IIUC, the kernel-allocated shadow stack will have the token always set
while the user-allocated one will be cleared. I was looking to
understand the inconsistency between these two cases in terms of the
final layout of the new shadow stack: one with the token, the other
without. I can see the need for checking but maybe start with requiring
it to be 0 and setting the token before returning, for consistency with
clone().

In the kernel-allocated shadow stack, is the token used for anything? I
can see it's used for signal delivery and return but I couldn't figure
out what it is used for in a thread's shadow stack.

Also, can one not use the clone3() to point to the clone()-allocated
shadow stack? Maybe that's unlikely as an app tends to stick to one
syscall flavour or the other.

> > > +		/*
> > > +		 * For CLONE_VFORK the child will share the parents
> > > +		 * shadow stack.  Make sure to clear the internal
> > > +		 * tracking of the thread shadow stack so the freeing
> > > +		 * logic run for child knows to leave it alone.
> > > +		 */
> > > +		if (clone_flags & CLONE_VFORK) {
> > > +			shstk->base = 0;
> > > +			shstk->size = 0;
> > > +			return 0;
> > > +		}
> 
> > I think we should leave the CLONE_VFORK check on its own independent of
> > the clone3() arguments. If one passes both CLONE_VFORK and specific
> > shadow stack address/size, they should be ignored (or maybe return an
> > error if you want to make it stricter).
> 
> This is existing logic from the current x86 code that's been reindented
> due to the addition of explicitly specified shadow stacks, it's not new
> behaviour.  It is needed to stop the child thinking it has the parent's
> shadow stack in the CLONE_VFORK case.

I figured that. But similar to the current !CLONE_VM behaviour where no
new shadow stack is allocated even if a new stack is passed to clone(),
I was thinking of something similar here for consistency: don't set up a
shadow stack in the CLONE_VFORK case or at least allow it only if a new
stack is being set up (if we extend this to clone(), it would be a small
ABI change).

> > > -	/*
> > > -	 * For !CLONE_VM the child will use a copy of the parents shadow
> > > -	 * stack.
> > > -	 */
> > > -	if (!(clone_flags & CLONE_VM))
> > > -		return 0;
> > > +		/*
> > > +		 * For !CLONE_VM the child will use a copy of the
> > > +		 * parents shadow stack.
> > > +		 */
> > > +		if (!(clone_flags & CLONE_VM))
> > > +			return 0;
> 
> > Is the !CLONE_VM case specific only to the default shadow stack
> > allocation? Sorry if this has been discussed already (or I completely
> > forgot) but I thought we'd only implement this for the thread creation
> > case. The typical fork() for a new process should inherit the parent's
> > layout, so applicable to the clone3() with the shadow stack arguments as
> > well (which should be ignored or maybe return an error with !CLONE_VM).
> 
> This is again all existing behaviour for the case where the user has not
> specified a shadow stack reindented, as mentioned above if the user has
> specified one explicitly then we just do what we were asked.  The
> existing behaviour is to only create a new shadow stack for the child in
> the CLONE_VM case and leave the child using the same shadow stack as the
> parent in the copied mm for !CLONE_VM.

I guess I was rather questioning the current choices than the new
clone3() ABI. But even for the new clone3() ABI, does it make sense to
set up a shadow stack if the current stack isn't changed? We'll end up
with a lot of possible combinations that will never get tested but
potentially become obscure ABI. Limiting the options to the sane choices
only helps with validation and unsurprising changes later on.

> > > @@ -2790,6 +2808,8 @@ pid_t kernel_clone(struct kernel_clone_args *args)
> > >  	 */
> > >  	trace_sched_process_fork(current, p);
> > >  
> > > +	shstk_post_fork(p, args);
> 
> > Do we need this post fork call? Can we not handle the setup via the
> > copy_thread() path in shstk_alloc_thread_stack()?
> 
> It looks like we do actually have the new mm in the process before we
> call copy_thread() so we could move things into there though we'd loose
> a small bit of factoring out of the error handling (at one point I had
> more code factored out but right now it's quite small, looking again we
> could also factor out the get_task_mm()/mput()).  ISTR having the new
> process' mm was the biggest reason for this initially but looking again
> I'm not sure why that was.  It does still feel like even the small
> amount that's factored out currently is useful though, a bit less
> duplication in the architecture code which feels welcome here.

I think you can probably keep this. My comment was based on the
assumption that we only support the CLONE_VM case where we wouldn't need
the access_remote_vm(), just some direct write similar to
write_user_shstk_64().

I still think we should have limited this ABI to the CLONE_VM and
!CLONE_VFORK cases but I don't have a strong view if the consensus was
to allow it for classic fork() and vfork() like uses (I just think they
won't be used).

-- 
Catalin