On Fri, Jan 13, 2017 at 10:12:36AM -0500, Christopher Covington wrote: > On 01/12/2017 11:58 AM, Will Deacon wrote: > > On Wed, Jan 11, 2017 at 09:41:16AM -0500, Christopher Covington wrote: > >> +#define __tlbi_asm_dsb(as, op, attr, ...) do { \ > >> + __TLBI_FOR(op, ##__VA_ARGS__) \ > >> + asm (__TLBI_INSTR(op, ##__VA_ARGS__) \ > >> + __TLBI_IO(op, ##__VA_ARGS__)); \ > >> + asm volatile ( as "\ndsb " #attr "\n" \ > >> + : : : "memory"); } while (0) > >> + > >> +#define __tlbi_dsb(...) __tlbi_asm_dsb("", ##__VA_ARGS__) > > > > I can't deny that this is cool, but ultimately it's completely unreadable. > > What I was thinking you'd do would be make __tlbi expand to: > > > > tlbi > > dsb > > tlbi > > dsb > > > > for Falkor, and: > > > > tlbi > > nop > > nop > > nop > > > > for everybody else. > > Thanks for the suggestion. So would __tlbi take a dsb sharability argument in > your proposal? Or would it be communicated in some other fashion, maybe inferred > from the tlbi argument? Or would the workaround dsbs all be the worst/broadest > case? I think always using inner-shareable should be ok. If you wanted to optimise this, you'd want to avoid the workaround altogether for non-shareable invalidation, but that's fairly rare and I doubt you'd be able to measure the impact. > > Wouldn't that localise this change sufficiently that you wouldn't need > > to change all the callers and encode the looping in your cpp macros? > > > > I realise you get an extra dsb in some places with that change, but I'd > > like to see numbers for the impact of that on top of the workaround. If > > it's an issue, then an alternative sequence would be: > > > > tlbi > > dsb > > tlbi > > > > and you'd rely on the existing dsb to complete that. > > > > Having said that, I don't understand how your current loop code works > > when the workaround is applied. AFAICT, you end up emitting something > > like: > > > > dsb ishst > > for i in 0 to n > > tlbi va+i > > dsb > > tlbi va+n > > dsb > > > > which looks wrong to me. Am I misreading something here? > > You're right, I am off by 1 << (PAGE_SHIFT - 12) here. I would need to > increment, compare, not take the loop branch (regular for loop stuff), > then decrement (missing) and perform TLB invalidation again (present but > using incorrect value). It also strikes me as odd that you only need one extra TLBI after the loop has finished, as opposed to a tlbi; dsb; tlbi loop body (which is what you'd get if you modified __tlbi as I suggest). Is it sufficient to have one extra TLBI after the loop and, if so, is the performance impact of my suggestion therefore unacceptable? Will _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm