Re: [kvm-arm] Big penalty on 2nd stage TLB misses.

"Sundaram, Senthilkumar" <ssundara@xxxxxxxxxxxxxxxx> · Wed, 23 Jan 2013 17:13:52 +0000

> -----Original Message-----
> From: Christoffer Dall [mailto:c.dall@xxxxxxxxxxxxxxxxxxxxxx]
> Sent: Wednesday, January 23, 2013 10:00 PM
> To: Sundaram, Senthilkumar
> Cc: kvmarm@xxxxxxxxxxxxxxxxxxxxx; Alexander Spyridakis
> (a.spyridakis@xxxxxxxxxxxxxxxxxxxxxx); Valiaparambil, Arun
> Subject: Re: [kvm-arm] Big penalty on 2nd stage TLB misses.
> 
> On Wed, Jan 23, 2013 at 6:11 AM, Sundaram, Senthilkumar
> <ssundara@xxxxxxxxxxxxxxxx> wrote:
> > I was interested in measuring the impact of second stage TLB misses
> > for the guest. So I ran an experiment as described below that  lets me
> > conclude that second stage TLB misses on are very expensive on KVM-
> ARM.
> >
> >
> 
> Hi,
> 
> First off thanks for sharing your results. It's really good to see that people are
> running experiments on this system.
> 
> >
> > Summary
> >
> > If I have a TLB miss on the host, the penalty is about 15 cycles per miss.
> > However, if I have the same TLB miss on the guest the penalty is about
> > 450 cycles. The only difference between the host & guest is the 2nd
> > stage translation table walk. So it leads me to believe that the 2nd
> > stage tlb-misses / 2nd stage table walk is a very expensive affair. I
> > appreciate any insights on my analysis.
> >
> >
> >
> > Experiment setup:
> >
> > I have a kernel module that access elements of an array in a loop. The
> > array elements are accessed with a stride of 4k so that each access is
> > in a different page. This is done so that each access requires a
> > different TLB entry.  I then use performance counters to measure the
> > cycles consumed by this loop.
> >
> >
> >
> > Before the execution of the test loop,  I have warm-up loop to warm
> > the cache ( and TLB).  Then I optionally flush the tlb and execute my test
> loop.
> > I run the test twice, once with flushed tlb and one with no flush. The
> > delta between these two runs should give me the effect of tlb miss.
> > Note that the cache is still warm.  The code is included below for your
> reference.
> >
> >
> >
> > Results:
> >
> > On the host, running it with warm TLB ~   2300 cycles
> >
> > On the host, running it with cold TLB ~ 3800 cycles.   There are 100 TLB
> > misses due to 100 access. Each TLB miss penalty is about 15 cycles.
> > This seems reasonable.
> >
> 
> this only seem reasonable if we hit cache for the page table walks though,
> right?
> 
> >
> >
> >
> >
> > On the guest, running with warm TLB ~ 2300 cycles.
> >
> > On the guest running it with cold TLB ~ 48000 cycles. There are 100
> > TLB misses. Each TLB miss penalty works out to ~450 cyles
> >
> >
> >
> > There must be something seriously wrong. One possible place to look is
> > the VTCR configuration regarding cache-ability of translation table
> > entries. Can someone more familiar with that part of the code check
> > its settings to see if everything is turned on correctly?
> >
> 
> I checked VTCR, and we set the cacheability fields to that of the TTBCR on the
> host.
> 
> While your conclusion that something is seriously wrong, may be true, it may
> also not be.
> 
> For a reference, check out this asplos article on the subject:
> http://www.cs.columbia.edu/~cdall/p26-bhargava.pdf
> 
> They argue for 24 page table entry lookups for a single miss, which is
> surprising, but nevertheless true (numbers look slightly different on ARM
> with only 3 levels of stage-1 page tables, but still).
> 
> Now, I don't remember if the ARM ARM states anything about the
> organization of TLB entries wrt. stage-2 translations. Also, I don't know the
> underlying hardware of your experiments.
> 
> One thing that could be interesting would be to look at the hardware perf
> event counters to see if we can verify the slowdown being more
> l1/l2 cache misses... I'm not sure, however, if the counters would count e.g. a
> non-cacheable page walk lookup as a cache miss.

[[ss]]  Hi Chris,

We are using the Versatile Express with A15 & A7 - the same setup that Virtual Open Systems published a guide on.

We are in fact collecting other performance event counters in along with cycle counts as part of this experiment . These include l1-data-miss, l2-data-miss & l1-tlb miss. Because of the warm-up stage there are no l2-data misses,  and there is one l1-data miss & one l1-tlb miss for every data access.  I am almost certain there is one l2-tlb miss for every access as well,  but there is no counter to measure that.  We flush the tlb before we 

It seems to me that most of the penalty comes from the stage 2 translation overhead because all the performance counter numbers ( l1 /l2 cache misses, l1-tlb misses ) are identical between the host & the guest. However when the tlb is flushed the host slows down very little, whereas the guest slows down a lot with the same flush.  If the slowdown was due to cache misses, then host would have slowed down as well during our experiment. Also, in our experiment we keep the cache warm  and only flush the tlb. The only thing different between the host & guest that could cause the additional slowdown is the stage 2 translation.  

To your other query, as per ARM ARM,  Page-walk-look up is not counted as a cache miss. It is counted only as a TLB miss.

I will go through the paper that you suggested as well. 
> 
> >
> >
> > I also wrote a user space test program (attached) that accesses memory
> > with long strides and the results indicate very similar behavior.
> > Running the program is 10x slower in guest compared to the host.  For
> > example, if you run the attached program with an argument of 1 ( long
> > stride size) , then you will notice a 10x time difference between the
> > Host & Guest. If you run the program with an argument of 0 ( short
> > stride size), then there will be no difference.  That again points to
> > an inefficient 2nd stage address translation.
> >
> >
> >
> > Code
> >
> >
> >
> > Void mytestfun(int flt)  // flt = flush tlb_flag
> >
> > {
> >
> >                 #define ROWS 100
> >
> >                 #define COLS 1024
> >
> >                 #define WARM_CT 10
> >
> >
> >
> >                 static volatile int t[ROWS][COLS]
> > __attribute__((aligned (4*1024)));
> >
> >                 volatile int j=0;
> >
> >                 volatile int i=0;
> >
> >                 struct counters_t report;
> >
> >
> >
> >                 // flush cache & tlb to begin with
> >
> >                 flush_cache (t, t+ROWS);
> >
> >                 flush_tlb();
> >
> >
> >
> >                 //warm cache & tlb   - WARM LOOP
> >
> >                 for(j=0;j<WARM_CT;j++)
> >
> >                 {
> >
> >                                 for(i=0; i<ROWS; i++)
> >
> >                                 {
> >
> >                                                 t[i][25] = 10;
> >
> >                                 }
> >
> >                 }
> >
> >
> >
> >                 // based on the params, keep the tlb warm or flush.
> >
> >                 if (flt==1)
> >
> >                     flush_tlb();
> >
> >
> >
> >                 start_perfmon();
> >
> >                 // start the test.  - TEST LOOP
> >
> >                 for(i=0; i<ROWS; i++)
> >
> >                 {
> >
> >                                 t[i][25] = 10;
> >
> >                 }
> >
> >                 stop_perfmon();
> >
> >                 read_counters(&report);
> >
> > }

_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm