Re: [PATCH v10 20/40] arm64/gcs: Ensure that new threads have a GCS

Catalin Marinas <catalin.marinas@xxxxxxx> · Tue, 20 Aug 2024 18:28:34 +0100

On Mon, Aug 19, 2024 at 04:57:08PM +0100, Mark Brown wrote:
> On Mon, Aug 19, 2024 at 01:04:18PM +0100, Catalin Marinas wrote:
> > On Thu, Aug 01, 2024 at 01:06:47PM +0100, Mark Brown wrote:
> > > +static int copy_thread_gcs(struct task_struct *p,
> > > +			   const struct kernel_clone_args *args)
> > > +{
> > > +	unsigned long gcs;
> > > +
> > > +	gcs = gcs_alloc_thread_stack(p, args);
> > > +	if (IS_ERR_VALUE(gcs))
> > > +		return PTR_ERR((void *)gcs);
> 
> > Is 0 an ok value here? I can see further down that
> > gcs_alloc_thread_stack() may return 0.
> 
> Yes, it's fine for a thread not to have a GCS.

OK, so we only get a 0 here if the gcs_{base,size} has not be
initialised. Looks fine.

> > > +	p->thread.gcs_el0_mode = current->thread.gcs_el0_mode;
> > > +	p->thread.gcs_el0_locked = current->thread.gcs_el0_locked;
> 
> > > +	/* Ensure the current state of the GCS is seen by CoW */
> > > +	gcsb_dsync();
> 
> > I don't get this barrier. What does it have to do with CoW, which memory
> > effects is it trying to order?
> 
> Yeah, I can't remember what that's supposed to be protecting.

The GCS memory writes in the parent must indeed be visible in the child
that could start on a different CPU. So, in principle, we need some form
of ordering similar to the context switch. However, in case of classic
fork(), the child won't be started until the PTEs have been made
read-only and a TLBI issued. This would ensure the completion of any GCS
memory accesses in the parent (at least that's my reading of the Arm
ARM).

If we have normal thread creation without CoW, is the parent writing
anything to the stack that the new thread needs to observe? The
map_shadow_stack() call will cause a GCSSTTR and this wouldn't be
ordered with subsequent memory writes. But we already have a GCSB DSYNC
in map_shadow_stack() after put_user_gcs().

My conclusion is that we don't need this barrier.

> > > +	/* Allocate RLIMIT_STACK/2 with limits of PAGE_SIZE..2G */
> > > +	size = PAGE_ALIGN(min_t(unsigned long long,
> > > +				rlimit(RLIMIT_STACK) / 2, SZ_2G));
> > > +	return max(PAGE_SIZE, size);
> > > +}
> 
> > So we still have RLIMIT_STACK/2. I thought we got rid of that and just
> > went with RLIMIT_STACK (or I misremember).
> 
> I honestly can't remember either way, it's quite possible it's changed
> multiple times.  I don't have super strong feelings on the particular
> value here.

The half size looks a lot more arbitrary to me than picking the same
size as the stack. So I'd go with RLIMIT_STACK.

> > > +static bool gcs_consume_token(struct mm_struct *mm, unsigned long user_addr)
> > > +{
> 
> > As per the clone3() thread, I think we should try to use
> > get_user_page_vma_remote() and do a cmpxchg() directly.
> 
> I've left this as is for now, mainly because it keeps the code in line
> with x86 and I can't directly test the x86 code. 

I thought for the clone3() x86 code we'll need the remote vma, so we
have to use the get_user_page_vma_remote() API anyway.

> IIRC we can't just do
> a standard userspace cmpxchg since that will access as though we were at
> EL0 but EL0 doesn't have standard write permission for the page.

Correct but GUP goes through the kernel mapping, not the user one. So
get_user_page_vma_remote() returns a page and you just do a classic
cmpxchg() at page_address() (plus some offset).

-- 
Catalin