Re: Optimization of array access

John Fine <johnsfine@xxxxxxxxxxx> · Fri, 25 Apr 2008 10:08:18 -0400

How much faster?  I looked for a bug in the code in case one version was 
doing more or less work than you expected, but I didn't find one.  The 
size of the speedup might give a better hint at the nature of the bug 
assuming there is one.

Assuming no bug, there is one subtle effect that might make the pointer 
version a little faster:

First, you should be aware that by accessing 8 million integers 
sequentially you are flushing the cache and the execution time will be 
dominated by the memory accesses to load the cache.  Every int in cache 
will be updated just once before being written back.  That happens by 
cache line rather than individual ints, but the ints are still read from 
main memory and written to main memory per update, never reused in cache.

The actual instructions used to do the access and update don't matter 
much if it all.  Nor does the overhead of the pointers.  Anything 
happening within CPU or cache is essentially free compared to the time 
needed to read and write main memory.

A cache is only partially associative and imperfectly LRU.  In the super 
regular access pattern of your array, that won't matter.  It will still 
discard the oldest contents and keep the newest contents.  Those newest 
contents will never be reused before they become oldest, so the cache's 
LRU decisions will be 100% wrong for predicting the actual access pattern.

But when you allocate many chunks with new, the addresses won't be as 
consistent, so the partial associativity will cause the cache to make 
discard decisions that are far different from LRU.  It is possible that 
some very old cache contents will be kept while newer contents are 
discarded.  So a few cache lines might still be valid from the last pass 
when they come around again in the next pass.  Then memory accesses 
would be saved and the whole loop would run a little faster.

If we assume a high number of ints, such as half a million, fit in cache 
and we assume absurdly high disruption in LRU causing half the cache 
lines to be preserved, then a quarter million of the 8 million accesses 
could be saved on nine out of ten times through the loop, so .9/32  = 
2.8% of the accesses or nearly 2.8% of total time.

So for a speedup of up to 2.8%, this might be the reason (though that 
would be a much more lucky LRU disrution than I'd expect).  If the 
speedup is more than 2.8% it must be something else entirely.

Christoph Bartoschek wrote:

Hi,

compiling the following code with g++ -O2 gives me unexpected results. The 
test_pointer() function seems to be faster although it uses more memory and 
has to use lots of pointer dereferenzes. Replacing line 19 by lines 15-18 
has no effect on the runtime.

Why is simulating an 3d-array by pointers faster than using a contingous 
memory array?