Re: [kvm-arm] Big penalty on 2nd stage TLB misses.

Christoffer Dall <c.dall@xxxxxxxxxxxxxxxxxxxxxx> · Wed, 23 Jan 2013 11:30:07 -0500

On Wed, Jan 23, 2013 at 6:11 AM, Sundaram, Senthilkumar
<ssundara@xxxxxxxxxxxxxxxx> wrote:
> I was interested in measuring the impact of second stage TLB misses for the
> guest. So I ran an experiment as described below that  lets me conclude that
> second stage TLB misses on are very expensive on KVM-ARM.
>
>

Hi,

First off thanks for sharing your results. It's really good to see
that people are running experiments on this system.

>
> Summary
>
> If I have a TLB miss on the host, the penalty is about 15 cycles per miss.
> However, if I have the same TLB miss on the guest the penalty is about 450
> cycles. The only difference between the host & guest is the 2nd stage
> translation table walk. So it leads me to believe that the 2nd stage
> tlb-misses / 2nd stage table walk is a very expensive affair. I appreciate
> any insights on my analysis.
>
>
>
> Experiment setup:
>
> I have a kernel module that access elements of an array in a loop. The array
> elements are accessed with a stride of 4k so that each access is in a
> different page. This is done so that each access requires a different TLB
> entry.  I then use performance counters to measure the cycles consumed by
> this loop.
>
>
>
> Before the execution of the test loop,  I have warm-up loop to warm the
> cache ( and TLB).  Then I optionally flush the tlb and execute my test loop.
> I run the test twice, once with flushed tlb and one with no flush. The delta
> between these two runs should give me the effect of tlb miss. Note that the
> cache is still warm.  The code is included below for your reference.
>
>
>
> Results:
>
> On the host, running it with warm TLB ~   2300 cycles
>
> On the host, running it with cold TLB ~ 3800 cycles.   There are 100 TLB
> misses due to 100 access. Each TLB miss penalty is about 15 cycles. This
> seems reasonable.
>

this only seem reasonable if we hit cache for the page table walks
though, right?

>
>
>
>
> On the guest, running with warm TLB ~ 2300 cycles.
>
> On the guest running it with cold TLB ~ 48000 cycles. There are 100 TLB
> misses. Each TLB miss penalty works out to ~450 cyles
>
>
>
> There must be something seriously wrong. One possible place to look is the
> VTCR configuration regarding cache-ability of translation table entries. Can
> someone more familiar with that part of the code check its settings to see
> if everything is turned on correctly?
>

I checked VTCR, and we set the cacheability fields to that of the
TTBCR on the host.

While your conclusion that something is seriously wrong, may be true,
it may also not be.

For a reference, check out this asplos article on the subject:
http://www.cs.columbia.edu/~cdall/p26-bhargava.pdf

They argue for 24 page table entry lookups for a single miss, which is
surprising, but nevertheless true (numbers look slightly different on
ARM with only 3 levels of stage-1 page tables, but still).

Now, I don't remember if the ARM ARM states anything about the
organization of TLB entries wrt. stage-2 translations. Also, I don't
know the underlying hardware of your experiments.

One thing that could be interesting would be to look at the hardware
perf event counters to see if we can verify the slowdown being more
l1/l2 cache misses... I'm not sure, however, if the counters would
count e.g. a non-cacheable page walk lookup as a cache miss.

>
>
> I also wrote a user space test program (attached) that accesses memory with
> long strides and the results indicate very similar behavior. Running the
> program is 10x slower in guest compared to the host.  For example, if you
> run the attached program with an argument of 1 ( long stride size) , then
> you will notice a 10x time difference between the Host & Guest. If you run
> the program with an argument of 0 ( short stride size), then there will be
> no difference.  That again points to an inefficient 2nd stage address
> translation.
>
>
>
> Code
>
>
>
> Void mytestfun(int flt)  // flt = flush tlb_flag
>
> {
>
>                 #define ROWS 100
>
>                 #define COLS 1024
>
>                 #define WARM_CT 10
>
>
>
>                 static volatile int t[ROWS][COLS] __attribute__((aligned
> (4*1024)));
>
>                 volatile int j=0;
>
>                 volatile int i=0;
>
>                 struct counters_t report;
>
>
>
>                 // flush cache & tlb to begin with
>
>                 flush_cache (t, t+ROWS);
>
>                 flush_tlb();
>
>
>
>                 //warm cache & tlb   - WARM LOOP
>
>                 for(j=0;j<WARM_CT;j++)
>
>                 {
>
>                                 for(i=0; i<ROWS; i++)
>
>                                 {
>
>                                                 t[i][25] = 10;
>
>                                 }
>
>                 }
>
>
>
>                 // based on the params, keep the tlb warm or flush.
>
>                 if (flt==1)
>
>                     flush_tlb();
>
>
>
>                 start_perfmon();
>
>                 // start the test.  – TEST LOOP
>
>                 for(i=0; i<ROWS; i++)
>
>                 {
>
>                                 t[i][25] = 10;
>
>                 }
>
>                 stop_perfmon();
>
>                 read_counters(&report);
>
> }

_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm