On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote: > On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote: > > On 20 February 2015 at 14:29, Andrew Jones <drjones@xxxxxxxxxx> wrote: > > > So looks like the 3 orders of magnitude greater number of traps > > > (only to el2) don't impact kernel compiles. > > > > > > > OK, good! That was what I was hoping for, obviously. > > > > > Then I thought I'd be able to quick measure the number of cycles > > > a trap to el2 takes with this kvm-unit-tests test > > > > > > int main(void) > > > { > > > unsigned long start, end; > > > unsigned int sctlr; > > > > > > asm volatile( > > > " mrs %0, sctlr_el1\n" > > > " msr pmcr_el0, %1\n" > > > : "=&r" (sctlr) : "r" (5)); > > > > > > asm volatile( > > > " mrs %0, pmccntr_el0\n" > > > " msr sctlr_el1, %2\n" > > > " mrs %1, pmccntr_el0\n" > > > : "=&r" (start), "=&r" (end) : "r" (sctlr)); > > > > > > printf("%llx\n", end - start); > > > return 0; > > > } > > > > > > after applying this patch to kvm > > > > > > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S > > > index bb91b6fc63861..5de39d740aa58 100644 > > > --- a/arch/arm64/kvm/hyp.S > > > +++ b/arch/arm64/kvm/hyp.S > > > @@ -770,7 +770,7 @@ > > > > > > mrs x2, mdcr_el2 > > > and x2, x2, #MDCR_EL2_HPMN_MASK > > > - orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) > > > +// orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR) > > > orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA) > > > > > > // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap > > > > > > But I get zero for the cycle count. Not sure what I'm missing. > > > > > > > No clue tbh. Does the counter work as expected in the host? > > > > Guess not. I dropped the test into a module_init and inserted > it on the host. Always get zero for pmccntr_el0 reads. Or, if > I set it to something non-zero with a write, then I always get > that back - no increments. pmcr_el0 looks OK... I had forgotten > to set bit 31 of pmcntenset_el0, but doing that still doesn't > help. Anyway, I assume the problem is me. I'll keep looking to > see what I'm missing. > I returned to this and see that the problem was indeed me. I needed yet another enable bit set (the filter register needed to be instructed to count cycles while in el2). I've attached the code for the curious. The numbers are mean=6999, std_dev=242. Run on the host, or in a guest running on a host without this patch series (after TVM traps have been disabled), I get a pretty consistent 40. I checked how many vm-sysreg traps we do during the kernel compile benchmark. It's 124924. So it's a bit strange that we don't see the benchmark taking 10 to 20 seconds longer on average. I should probably double check my runs. In any case, while I like the approach of this series, the overhead is looking non-negligible. drew
#include <libcflat.h> static void prep_cc(void) { asm volatile( " msr pmovsclr_el0, %0\n" " msr pmccfiltr_el0, %1\n" " msr pmcntenset_el0, %2\n" " msr pmcr_el0, %3\n" " isb\n" : : "r" (1 << 31), "r" (1 << 27), "r" (1 << 31), "r" (1 << 6 | 1 << 2 | 1 << 0)); } int main(void) { unsigned long start, end; unsigned int sctlr; int i, zeros = 0; asm volatile("mrs %0, sctlr_el1" : "=&r" (sctlr)); prep_cc(); for (i = 0; i < 100000; ++i) { asm volatile( " mrs %0, pmccntr_el0\n" " msr sctlr_el1, %2\n" " mrs %1, pmccntr_el0\n" " isb\n" : "=&r" (start), "=&r" (end) : "r" (sctlr)); if ((i % 10) == 0) printf("\n"); printf(" %d", end - start); if ((end - start) == 0) { ++zeros; prep_cc(); } } printf("\nnum zero counts = %d\n", zeros); return 0; }