Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

Will Deacon <will@xxxxxxxxxx> · Tue, 5 Jan 2021 22:41:19 +0000

On Tue, Jan 05, 2021 at 08:20:51AM -0800, Andy Lutomirski wrote:
> > On Jan 5, 2021, at 5:26 AM, Will Deacon <will@xxxxxxxxxx> wrote:
> > Sorry for the slow reply, I was socially distanced from my keyboard.
> > 
> >> On Mon, Dec 28, 2020 at 04:36:11PM -0800, Andy Lutomirski wrote:
> >> On Mon, Dec 28, 2020 at 4:11 PM Nicholas Piggin <npiggin@xxxxxxxxx> wrote:
> >>>> +static inline void membarrier_sync_core_before_usermode(void)
> >>>> +{
> >>>> +     /*
> >>>> +      * XXX: I know basically nothing about powerpc cache management.
> >>>> +      * Is this correct?
> >>>> +      */
> >>>> +     isync();
> >>> 
> >>> This is not about memory ordering or cache management, it's about
> >>> pipeline management. Powerpc's return to user mode serializes the
> >>> CPU (aka the hardware thread, _not_ the core; another wrongness of
> >>> the name, but AFAIKS the HW thread is what is required for
> >>> membarrier). So this is wrong, powerpc needs nothing here.
> >> 
> >> Fair enough.  I'm happy to defer to you on the powerpc details.  In
> >> any case, this just illustrates that we need feedback from a person
> >> who knows more about ARM64 than I do.
> > 
> > I think we're in a very similar boat to PowerPC, fwiw. Roughly speaking:
> > 
> >  1. SYNC_CORE does _not_ perform any cache management; that is the
> >     responsibility of userspace, either by executing the relevant
> >     maintenance instructions (arm64) or a system call (arm32). Crucially,
> >     the hardware will ensure that this cache maintenance is broadcast
> >     to all other CPUs.
> 
> Is this guaranteed regardless of any aliases?  That is, if I flush from
> one CPU at one VA and then execute the same physical address from another
> CPU at a different VA, does this still work?

The data side will be fine, but the instruction side can have virtual
aliases. We handle this in flush_ptrace_access() by blowing away the whole
I-cache if we're not physically-indexed, but userspace would be in trouble
if it wanted to handle this situation alone.

> >  2. Even with all the cache maintenance in the world, a CPU could have
> >     speculatively fetched stale instructions into its "pipeline" ahead of
> >     time, and these are _not_ flushed by the broadcast maintenance instructions
> >     in (1). SYNC_CORE provides a means for userspace to discard these stale
> >     instructions.
> > 
> >  3. The context synchronization event on exception entry/exit is
> >     sufficient here. The Arm ARM isn't very good at describing what it
> >     does, because it's in denial about the existence of a pipeline, but
> >     it does have snippets such as:
> > 
> >    (s/PE/CPU/)
> >       | For all types of memory:
> >       | The PE might have fetched the instructions from memory at any time
> >       | since the last Context synchronization event on that PE.
> > 
> >     Interestingly, the architecture recently added a control bit to remove
> >     this synchronisation from exception return, so if we set that then we'd
> >     have a problem with SYNC_CORE and adding an ISB would be necessary (and
> >     we could probable then make kernel->kernel returns cheaper, but I
> >     suspect we're relying on this implicit synchronisation in other places
> >     too).
> > 
> 
> Is ISB just a context synchronization event or does it do more?

That's a good question. Barrier instructions on ARM do tend to get
overloaded with extra behaviours over time, so it could certainly end up
doing the context synchronization event + extra stuff in future. Right now,
the only thing that springs to mind is the spectre-v1 heavy mitigation
barrier of 'DSB; ISB' which, for example, probably doesn't work for 'DSB;
ERET' because the ERET can be treated like a conditional (!) branch.

> On x86, it’s very hard to tell that MFENCE does any more than LOCK, but
> it’s much slower.  And we have LFENCE, which, as documented, doesn’t
> appear to have any semantics at all.  (Or at least it didn’t before
> Spectre.)

I tend to think of ISB as a front-end barrier relating to instruction fetch
whereas DMB, acquire/release and DSB are all back-end barriers relating to
memory accesses. You _can_ use ISB in conjunction with control dependencies
to order a pair of loads (like you can with ISYNC on Power), but it's a
really expensive way to do it.

> > Are you seeing a problem in practice, or did this come up while trying to
> > decipher the semantics of SYNC_CORE?
> 
> It came up while trying to understand the code and work through various
> bugs in it.  The code was written using something approximating x86
> terminology, but it was definitely wrong on x86 (at least if you believe
> the SDM, and I haven’t convinced any architects to say otherwise).

Ok, thanks.

Will