Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests

Boqun Feng <boqun.feng@xxxxxxxxx> · Mon, 15 Aug 2016 08:56:17 +0800

On Sun, Aug 14, 2016 at 03:02:20PM +0000, Mathieu Desnoyers wrote:
> ----- On Aug 12, 2016, at 9:28 PM, Boqun Feng boqun.feng@xxxxxxxxx wrote:
> 
> > On Fri, Aug 12, 2016 at 06:11:45PM +0000, Mathieu Desnoyers wrote:
> >> ----- On Aug 12, 2016, at 12:35 PM, Boqun Feng boqun.feng@xxxxxxxxx wrote:
> >> 
> >> > On Fri, Aug 12, 2016 at 01:30:15PM +0800, Boqun Feng wrote:
> >> > [snip]
> >> >> > > Besides, do we allow userspace programs do read-only access to the
> >> >> > > memory objects modified by do_rseq(). If so, we have a problem when
> >> >> > > there are two writes in a do_rseq()(either in the rseq critical section
> >> >> > > or in the asm block), because in current implemetation, these two writes
> >> >> > > are unordered, which makes the readers outside a do_rseq() could observe
> >> >> > > the ordering of writes differently.
> >> >> > > 
> >> >> > > For rseq_finish2(), a simple solution would be making the "final" write
> >> >> > > a RELEASE.
> >> >> > 
> >> >> > Indeed, we would need a release semantic for the final store here if this
> >> >> > is the common use. Or we could duplicate the "flavors" of rseq_finish2 and
> >> >> > add a rseq_finish2_release. We should find a way to eliminate code duplication
> >> >> 
> >> >> I'm in favor of a separate rseq_finish2_release().
> >> >> 
> >> >> > there. I suspect we'll end up doing macros.
> >> >> > 
> >> >> 
> >> >> Me too. Lemme have a try ;-)
> >> >> 
> >> > 
> >> > How about this? Although a little messy, I separated the asm block into
> >> > several parts and implemented each part in a arch-diagnose way.
> >> 
> >> I find it rather hard to follow the per-arch assembly with this approach.
> >> It might prove to be troublesome if we want to do arch-specific optimizations
> >> in the future.
> >> 
> > 
> > It might be, but I was just trying to kill as much duplicate code as
> > possible, because the more duplicate we have, the more maintain effort
> > we need.
> > 
> > For example, PPC32 and PPC64 may have the same asm code to check the
> > event counter, but different code to do the final store.  Having the
> > same RSEQ_CHECK_COUNTER() for PPC32 and PPC64 actually makes it easy if
> > we come up a way to optimize the counter check code on PPC.
> > 
> > And if some arch wants to have some very specifical optimizations,
> > it could always write the whole asm block again rather than use the
> > helpers macros.
> 
> Creating macros for each assembly "operation" done in the restartable
> sequence ends up requiring that people learn a new custom mini-language,
> and implement those macros for each architecture.
> 
> I'd rather prefer to let each architecture maintainer express the
> restartable sequence directly in assembly, which is already known to
> them, than require them to learn a new small macro-based language.
> 
> Eliminating duplicated code is a goal I agree with, but there are
> ways to achieve this which don't end up creating a macro-based custom
> mini-language (such as what I proposed below).
> 

Fair point ;-)

One more thing, do we want to use arch-specific header files to put
arch-specific assembly code? For example, rseq-x86.h, rseq-powerpc.h,
etc. This may save readers a lot of time if he or she is only interested
in a particular arch, and also make maintaining a little easier(no need
to worry about breaking other archs accidentally)

[...]
> 
> In terms of fast-path, you would be trading:
> 
> (1)
> 	"ldr r0, %[current_event_counter]\n\t" \
> 	"mov r1, #0\n\t"
> 	"cmp %[start_event_counter], r0\n\t" \
> 	"bne %l[failure]\n\t" \
> 	"str %[to_write_final], [%[target_final]]\n\t" \
> 	"2:\n\t" \
> 	"str r1, [%[rseq_cs]]\n\t" \
> for
> 
> (2)
> 	"ldr r0, %[current_event_counter]\n\t" \
> 	"cmp %[start_event_counter], r0\n\t" \
> 	"bne %l[failure]\n\t" \
> 	"str %[to_write_final], [%[target_final]]\n\t" \
> 	"2:\n\t" \
> 	"mov r0, #0\n\t"
> 	"str r0, [%[rseq_cs]]\n\t" \
> 
> Your proposal (2) saves a register (does not clobber r1), but this
> is at the expense of a slower fast-path. In (1), loading the constant
> 0 is done while the processor is stalled on the current_event_counter
> load, which is needed by a following comparison. Therefore, we can
> increase instruction-level parallelism by placing the immediate value
> 0 load right after the ldr instruction. This, however, requires that
> we use a different register than r0, because r0 is already used by the
> ldr/cmp instructions.
> 
> Since this is a fast-path, achieving higher instruction throughput
> is more important than saving a register.
> 
> I came up with this as an optimization while doing benchmarking
> on a ARM32 Cubietruck as a reference architecture.
> 

Nice ;-) Better to put a comment there?

I should try to investigate something similar for powerpc.

> > 
> >> +		"b 4f\n\t" \
> >> +		".balign 32\n\t" \
> >> +		"3:\n\t" \
> >> +		".word 1b, 0x0, 2b, 0x0, l[failure], 0x0, 0x0, 0x0\n\t" \
> >> +		"4:\n\t" \
> >> +		: /* no outputs */ \
> >> +		: [start_event_counter]"r"((_start_value).event_counter), \
> >> +		  [current_event_counter]"m"((_start_value).rseqp->u.e.event_counter), \
> >> +		  [to_write_final]"r"(_to_write_final), \
> >> +		  [target_final]"r"(_target_final), \
> >> +		  [rseq_cs]"r"(&(_start_value).rseqp->rseq_cs) \
> >> +		  extra_input \
> >> +		  RSEQ_INJECT_INPUT \
> >> +		: "r0", "r1", "memory", "cc" \
> >> +		  RSEQ_INJECT_CLOBBER \
> >> +		: _failure \
> >>  	);
> > 
> > [...]
> > 
> >> +
> >> +#define RSEQ_FINISH2_RELEASE_SPECULATIVE_STORE_ASM() \
> >> +		RSEQ_FINISH2_SPECULATIVE_STORE_ASM() \
> >> +		"dmb\n\t"
> >> +
> > 
> > Having a RELEASE barrier here may be OK for all current archs we
> > support, but there are archs which rather than have a lightweight
> > RELEASE barrier but use a special instruction for RELEASE operations,
> > for example, AArch64. Do we need to take that into consideration and
> > define a RSEQ_FINISH_ASM_RELEASE() rather than a
> > RSEQ_FINISH2_SPECULATIVE_STORE_INPUT_ASM()?
> 
> Good point. We should introduce the barrier before the final
> store to fit this scenario. This would also work if we want to
> do many speculative stores followed by a final store: it really
> makes sense to put the barrier just before the final store rather
> than after each speculative store.
> 
> I just pushed a commit in my dev branch implementing this. Testing
> is welcome.
> 

Sure, let me play around ;-)

Regards,
Boqun

> Thanks!
> 
> Mathieu
> 
> > 
> > [...]
> > 
> > Regards
> > Boqun
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
Attachment:
signature.asc

Description: PGP signature