----- On Aug 14, 2016, at 8:56 PM, Boqun Feng boqun.feng@xxxxxxxxx wrote: > On Sun, Aug 14, 2016 at 03:02:20PM +0000, Mathieu Desnoyers wrote: >> ----- On Aug 12, 2016, at 9:28 PM, Boqun Feng boqun.feng@xxxxxxxxx wrote: >> >> > On Fri, Aug 12, 2016 at 06:11:45PM +0000, Mathieu Desnoyers wrote: >> >> ----- On Aug 12, 2016, at 12:35 PM, Boqun Feng boqun.feng@xxxxxxxxx wrote: >> >> >> >> > On Fri, Aug 12, 2016 at 01:30:15PM +0800, Boqun Feng wrote: >> >> > [snip] >> >> >> > > Besides, do we allow userspace programs do read-only access to the >> >> >> > > memory objects modified by do_rseq(). If so, we have a problem when >> >> >> > > there are two writes in a do_rseq()(either in the rseq critical section >> >> >> > > or in the asm block), because in current implemetation, these two writes >> >> >> > > are unordered, which makes the readers outside a do_rseq() could observe >> >> >> > > the ordering of writes differently. >> >> >> > > >> >> >> > > For rseq_finish2(), a simple solution would be making the "final" write >> >> >> > > a RELEASE. >> >> >> > >> >> >> > Indeed, we would need a release semantic for the final store here if this >> >> >> > is the common use. Or we could duplicate the "flavors" of rseq_finish2 and >> >> >> > add a rseq_finish2_release. We should find a way to eliminate code duplication >> >> >> >> >> >> I'm in favor of a separate rseq_finish2_release(). >> >> >> >> >> >> > there. I suspect we'll end up doing macros. >> >> >> > >> >> >> >> >> >> Me too. Lemme have a try ;-) >> >> >> >> >> > >> >> > How about this? Although a little messy, I separated the asm block into >> >> > several parts and implemented each part in a arch-diagnose way. >> >> >> >> I find it rather hard to follow the per-arch assembly with this approach. >> >> It might prove to be troublesome if we want to do arch-specific optimizations >> >> in the future. >> >> >> > >> > It might be, but I was just trying to kill as much duplicate code as >> > possible, because the more duplicate we have, the more maintain effort >> > we need. >> > >> > For example, PPC32 and PPC64 may have the same asm code to check the >> > event counter, but different code to do the final store. Having the >> > same RSEQ_CHECK_COUNTER() for PPC32 and PPC64 actually makes it easy if >> > we come up a way to optimize the counter check code on PPC. >> > >> > And if some arch wants to have some very specifical optimizations, >> > it could always write the whole asm block again rather than use the >> > helpers macros. >> >> Creating macros for each assembly "operation" done in the restartable >> sequence ends up requiring that people learn a new custom mini-language, >> and implement those macros for each architecture. >> >> I'd rather prefer to let each architecture maintainer express the >> restartable sequence directly in assembly, which is already known to >> them, than require them to learn a new small macro-based language. >> >> Eliminating duplicated code is a goal I agree with, but there are >> ways to achieve this which don't end up creating a macro-based custom >> mini-language (such as what I proposed below). >> > > Fair point ;-) > > One more thing, do we want to use arch-specific header files to put > arch-specific assembly code? For example, rseq-x86.h, rseq-powerpc.h, > etc. This may save readers a lot of time if he or she is only interested > in a particular arch, and also make maintaining a little easier(no need > to worry about breaking other archs accidentally) > > [...] Good point. I wanted to wait until we had enough architectures before doing this, but now that we have x86 32/64, ppc 32/64 and arm 32, it appears to be the right time. Done and pushed. >> >> In terms of fast-path, you would be trading: >> >> (1) >> "ldr r0, %[current_event_counter]\n\t" \ >> "mov r1, #0\n\t" >> "cmp %[start_event_counter], r0\n\t" \ >> "bne %l[failure]\n\t" \ >> "str %[to_write_final], [%[target_final]]\n\t" \ >> "2:\n\t" \ >> "str r1, [%[rseq_cs]]\n\t" \ >> for >> >> (2) >> "ldr r0, %[current_event_counter]\n\t" \ >> "cmp %[start_event_counter], r0\n\t" \ >> "bne %l[failure]\n\t" \ >> "str %[to_write_final], [%[target_final]]\n\t" \ >> "2:\n\t" \ >> "mov r0, #0\n\t" >> "str r0, [%[rseq_cs]]\n\t" \ >> >> Your proposal (2) saves a register (does not clobber r1), but this >> is at the expense of a slower fast-path. In (1), loading the constant >> 0 is done while the processor is stalled on the current_event_counter >> load, which is needed by a following comparison. Therefore, we can >> increase instruction-level parallelism by placing the immediate value >> 0 load right after the ldr instruction. This, however, requires that >> we use a different register than r0, because r0 is already used by the >> ldr/cmp instructions. >> >> Since this is a fast-path, achieving higher instruction throughput >> is more important than saving a register. >> >> I came up with this as an optimization while doing benchmarking >> on a ARM32 Cubietruck as a reference architecture. >> > > Nice ;-) Better to put a comment there? Done. > > I should try to investigate something similar for powerpc. > Yes, you could try clobbering one extra register to move the "li %%r17, 0\n\t" right after the lwz instruction. Depending on the architecture characteristics, it may speed it up a bit. I would expect that benchmarks on older architectures (e.g. old ppc32) might be more affected by such tweak than newer POWER8. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html