On Sun, Aug 14, 2016 at 03:02:20PM +0000, Mathieu Desnoyers wrote: > ----- On Aug 12, 2016, at 9:28 PM, Boqun Feng boqun.feng@xxxxxxxxx wrote: > > > On Fri, Aug 12, 2016 at 06:11:45PM +0000, Mathieu Desnoyers wrote: > >> ----- On Aug 12, 2016, at 12:35 PM, Boqun Feng boqun.feng@xxxxxxxxx wrote: > >> > >> > On Fri, Aug 12, 2016 at 01:30:15PM +0800, Boqun Feng wrote: > >> > [snip] > >> >> > > Besides, do we allow userspace programs do read-only access to the > >> >> > > memory objects modified by do_rseq(). If so, we have a problem when > >> >> > > there are two writes in a do_rseq()(either in the rseq critical section > >> >> > > or in the asm block), because in current implemetation, these two writes > >> >> > > are unordered, which makes the readers outside a do_rseq() could observe > >> >> > > the ordering of writes differently. > >> >> > > > >> >> > > For rseq_finish2(), a simple solution would be making the "final" write > >> >> > > a RELEASE. > >> >> > > >> >> > Indeed, we would need a release semantic for the final store here if this > >> >> > is the common use. Or we could duplicate the "flavors" of rseq_finish2 and > >> >> > add a rseq_finish2_release. We should find a way to eliminate code duplication > >> >> > >> >> I'm in favor of a separate rseq_finish2_release(). > >> >> > >> >> > there. I suspect we'll end up doing macros. > >> >> > > >> >> > >> >> Me too. Lemme have a try ;-) > >> >> > >> > > >> > How about this? Although a little messy, I separated the asm block into > >> > several parts and implemented each part in a arch-diagnose way. > >> > >> I find it rather hard to follow the per-arch assembly with this approach. > >> It might prove to be troublesome if we want to do arch-specific optimizations > >> in the future. > >> > > > > It might be, but I was just trying to kill as much duplicate code as > > possible, because the more duplicate we have, the more maintain effort > > we need. > > > > For example, PPC32 and PPC64 may have the same asm code to check the > > event counter, but different code to do the final store. Having the > > same RSEQ_CHECK_COUNTER() for PPC32 and PPC64 actually makes it easy if > > we come up a way to optimize the counter check code on PPC. > > > > And if some arch wants to have some very specifical optimizations, > > it could always write the whole asm block again rather than use the > > helpers macros. > > Creating macros for each assembly "operation" done in the restartable > sequence ends up requiring that people learn a new custom mini-language, > and implement those macros for each architecture. > > I'd rather prefer to let each architecture maintainer express the > restartable sequence directly in assembly, which is already known to > them, than require them to learn a new small macro-based language. > > Eliminating duplicated code is a goal I agree with, but there are > ways to achieve this which don't end up creating a macro-based custom > mini-language (such as what I proposed below). > Fair point ;-) One more thing, do we want to use arch-specific header files to put arch-specific assembly code? For example, rseq-x86.h, rseq-powerpc.h, etc. This may save readers a lot of time if he or she is only interested in a particular arch, and also make maintaining a little easier(no need to worry about breaking other archs accidentally) [...] > > In terms of fast-path, you would be trading: > > (1) > "ldr r0, %[current_event_counter]\n\t" \ > "mov r1, #0\n\t" > "cmp %[start_event_counter], r0\n\t" \ > "bne %l[failure]\n\t" \ > "str %[to_write_final], [%[target_final]]\n\t" \ > "2:\n\t" \ > "str r1, [%[rseq_cs]]\n\t" \ > for > > (2) > "ldr r0, %[current_event_counter]\n\t" \ > "cmp %[start_event_counter], r0\n\t" \ > "bne %l[failure]\n\t" \ > "str %[to_write_final], [%[target_final]]\n\t" \ > "2:\n\t" \ > "mov r0, #0\n\t" > "str r0, [%[rseq_cs]]\n\t" \ > > Your proposal (2) saves a register (does not clobber r1), but this > is at the expense of a slower fast-path. In (1), loading the constant > 0 is done while the processor is stalled on the current_event_counter > load, which is needed by a following comparison. Therefore, we can > increase instruction-level parallelism by placing the immediate value > 0 load right after the ldr instruction. This, however, requires that > we use a different register than r0, because r0 is already used by the > ldr/cmp instructions. > > Since this is a fast-path, achieving higher instruction throughput > is more important than saving a register. > > I came up with this as an optimization while doing benchmarking > on a ARM32 Cubietruck as a reference architecture. > Nice ;-) Better to put a comment there? I should try to investigate something similar for powerpc. > > > >> + "b 4f\n\t" \ > >> + ".balign 32\n\t" \ > >> + "3:\n\t" \ > >> + ".word 1b, 0x0, 2b, 0x0, l[failure], 0x0, 0x0, 0x0\n\t" \ > >> + "4:\n\t" \ > >> + : /* no outputs */ \ > >> + : [start_event_counter]"r"((_start_value).event_counter), \ > >> + [current_event_counter]"m"((_start_value).rseqp->u.e.event_counter), \ > >> + [to_write_final]"r"(_to_write_final), \ > >> + [target_final]"r"(_target_final), \ > >> + [rseq_cs]"r"(&(_start_value).rseqp->rseq_cs) \ > >> + extra_input \ > >> + RSEQ_INJECT_INPUT \ > >> + : "r0", "r1", "memory", "cc" \ > >> + RSEQ_INJECT_CLOBBER \ > >> + : _failure \ > >> ); > > > > [...] > > > >> + > >> +#define RSEQ_FINISH2_RELEASE_SPECULATIVE_STORE_ASM() \ > >> + RSEQ_FINISH2_SPECULATIVE_STORE_ASM() \ > >> + "dmb\n\t" > >> + > > > > Having a RELEASE barrier here may be OK for all current archs we > > support, but there are archs which rather than have a lightweight > > RELEASE barrier but use a special instruction for RELEASE operations, > > for example, AArch64. Do we need to take that into consideration and > > define a RSEQ_FINISH_ASM_RELEASE() rather than a > > RSEQ_FINISH2_SPECULATIVE_STORE_INPUT_ASM()? > > Good point. We should introduce the barrier before the final > store to fit this scenario. This would also work if we want to > do many speculative stores followed by a final store: it really > makes sense to put the barrier just before the final store rather > than after each speculative store. > > I just pushed a commit in my dev branch implementing this. Testing > is welcome. > Sure, let me play around ;-) Regards, Boqun > Thanks! > > Mathieu > > > > > [...] > > > > Regards > > Boqun > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com
Attachment:
signature.asc
Description: PGP signature