How much faster? I looked for a bug in the code in case one version was
doing more or less work than you expected, but I didn't find one. The
size of the speedup might give a better hint at the nature of the bug
assuming there is one.
Assuming no bug, there is one subtle effect that might make the pointer
version a little faster:
First, you should be aware that by accessing 8 million integers
sequentially you are flushing the cache and the execution time will be
dominated by the memory accesses to load the cache. Every int in cache
will be updated just once before being written back. That happens by
cache line rather than individual ints, but the ints are still read from
main memory and written to main memory per update, never reused in cache.
The actual instructions used to do the access and update don't matter
much if it all. Nor does the overhead of the pointers. Anything
happening within CPU or cache is essentially free compared to the time
needed to read and write main memory.
A cache is only partially associative and imperfectly LRU. In the super
regular access pattern of your array, that won't matter. It will still
discard the oldest contents and keep the newest contents. Those newest
contents will never be reused before they become oldest, so the cache's
LRU decisions will be 100% wrong for predicting the actual access pattern.
But when you allocate many chunks with new, the addresses won't be as
consistent, so the partial associativity will cause the cache to make
discard decisions that are far different from LRU. It is possible that
some very old cache contents will be kept while newer contents are
discarded. So a few cache lines might still be valid from the last pass
when they come around again in the next pass. Then memory accesses
would be saved and the whole loop would run a little faster.
If we assume a high number of ints, such as half a million, fit in cache
and we assume absurdly high disruption in LRU causing half the cache
lines to be preserved, then a quarter million of the 8 million accesses
could be saved on nine out of ten times through the loop, so .9/32 =
2.8% of the accesses or nearly 2.8% of total time.
So for a speedup of up to 2.8%, this might be the reason (though that
would be a much more lucky LRU disrution than I'd expect). If the
speedup is more than 2.8% it must be something else entirely.
Christoph Bartoschek wrote:
Hi,
compiling the following code with g++ -O2 gives me unexpected results. The
test_pointer() function seems to be faster although it uses more memory and
has to use lots of pointer dereferenzes. Replacing line 19 by lines 15-18
has no effect on the runtime.
Why is simulating an 3d-array by pointers faster than using a contingous
memory array?