[kvm-arm] Big penalty on 2nd stage TLB misses.

"Sundaram, Senthilkumar" <ssundara@xxxxxxxxxxxxxxxx> · Wed, 23 Jan 2013 11:11:38 +0000

I was interested in measuring the impact of second stage TLB misses for the guest. So I ran an experiment as described below that  lets me conclude that
second stage TLB misses on are very expensive on KVM-ARM. 

Summary
If I have a TLB miss on the host, the penalty is about 15 cycles per miss. However, if I have the same TLB miss on the guest the penalty is about 450 cycles. The only difference between the host & guest is the 2^nd stage translation
 table walk. So it leads me to believe that the 2^nd stage tlb-misses / 2^nd stage table walk is a very expensive affair. I appreciate any insights on my analysis.

Experiment setup:
I have a kernel module that access elements of an array in a loop. The array elements are accessed with a stride of 4k so that each access is in a different page. This is done so that each access requires a different TLB entry.  I then
 use performance counters to measure the cycles consumed by this loop. 

Before the execution of the test loop,  I have warm-up loop to warm the cache ( and TLB).  Then I optionally flush the tlb and execute my test loop.  I run the test twice, once with flushed tlb and one with no flush. The delta between these
 two runs should give me the effect of tlb miss. Note that the cache is still warm.  The code is included below for your reference.

Results:
On the host, running it with warm TLB ~   2300 cycles
On the host, running it with cold TLB ~ 3800 cycles.   There are 100 TLB misses due to 100 access. Each TLB miss penalty is about 15 cycles. This seems reasonable.

On the guest, running with warm TLB ~ 2300 cycles.
On the guest running it with cold TLB ~ 48000 cycles. There are 100 TLB misses. Each TLB miss penalty works out to ~450 cyles

There must be something seriously wrong. One possible place to look is the VTCR configuration regarding cache-ability of translation table entries. Can someone more familiar with that part of the code check its settings to see if everything
 is turned on correctly?

I also wrote a user space test program (attached) that accesses memory with long strides and the results indicate very similar behavior. Running the program is 10x slower in guest compared to the host.  For example, if you run the attached
 program with an argument of 1 ( long stride size) , then you will notice a 10x time difference between the Host & Guest. If you run the program with an argument of 0 ( short stride size), then there will be no difference.  That again points to an inefficient
 2^nd stage address translation.

Code

Void mytestfun(int flt)  // flt = flush tlb_flag
{
                #define ROWS 100
                #define COLS 1024
                #define WARM_CT 10

                static volatile int t[ROWS][COLS] __attribute__((aligned (4*1024)));
                volatile int j=0;
                volatile int i=0;
                struct counters_t report;

                // flush cache & tlb to begin with
                flush_cache (t, t+ROWS);
                flush_tlb();

                //warm cache & tlb   - WARM LOOP
                for(j=0;j<WARM_CT;j++)
                {
                                for(i=0; i<ROWS; i++)
                                {
                                                t[i][25] = 10;
                                }
                }

                // based on the params, keep the tlb warm or flush.
                if (flt==1)
                    flush_tlb();

                start_perfmon();
                // start the test.  – TEST LOOP
                for(i=0; i<ROWS; i++)
                {
                                t[i][25] = 10;
                }
                stop_perfmon();
                read_counters(&report);
}

#include<stdio.h>
#include<stdlib.h>
#include<math.h>

#define ROWS 4096
#define COLS 4096
#define LOOPS 16384

char* usage_string = 
			" Program access different elements of two-d array of size 4096x4096\n"
			" Usage: <Program>  order \n"
			" order: [0,1] . 0 is row-order. Processing is done for first row. In that row, all elements are processed sequentially  \n"
			"                1 is col-order. Processing is done for first col. In that col, all elements are processed sequentially\n"
			"               row-oder processes elements in the order of their storage ( assuming row major storage)\n"
         "               col-order processing introduces a stride in accessing elements. The stride is equal to the number of cols in the array\n"
         "               Obviously row-order will be cache & TLB friendly and expected to be more efficient\n"
         "               Greater the number of COLS, greater the stride and therefore more IN-EFFICIENT interms of cache / TLB \n";

int main(int argc, char* argv[])
{
	static volatile int data[ROWS][COLS] __attribute__((aligned (4*1024)));;
	volatile int i,j,k, temp;

	if (argc != 2)
	{
		printf ("Incorrect number of args %s\n",usage_string);	
		return;
	}

	int order = atoi(argv[1]);
	if ( order != 0 && order !=1 )
	{
		printf ("Incorrect first arg %d %s\n",order, usage_string);
		return;
	}

	int loops = LOOPS;
	//loops atoi(argv[2]);

	if (order == 0)  // row order processing
	{
		i = 0;
		for ( k= 0; k < loops; k++)
		{
			for ( j=0; j < COLS;j++)  
			{
				temp = j + k;
				data[i][j] = temp; 
			}
		}
	}
	else if (order == 1)   // col order processing
	{
		j = 0;
		for ( k= 0; k < loops; k++)
		{
			for (i = 0; i < ROWS; i++)  
			{
		  		temp = i+k;
		  		data[i][j] = temp; 
			}
		}
	}
	else
	{
		printf("Unknown order: %d\n", order);
		return;
	}	
}

_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm