On 20/05/11 07:28, Ingo Molnar wrote: > > * Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > >> On Thu, May 19, 2011 at 10:12 AM, Linus Torvalds >> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: >>> >>> Now, notice that right now I'm *only* talking about removing it for >>> the "hlist" cases (patch attached). I suspect we should do the same >>> thing for all the list helpers. >> >> Actually, it's the "rcu" versions of the hlist helpers that need this >> most, since those are the performance-critical ones and the ones used >> in avc traversal. So the previous patch did nothing. >> >> So here's the actual patch I think I should commit. >> >> Added davem, benh and rmk explicitly - I think you're on linux-arch, >> but still.. You may have machines that like prefetch more, OK, I tested this on an Alpha since the Alpha Arch. Manual and Alpha Compiler Writers' Guide gives examples of software prefetching and says it's a good thing to do in certain situations. I ran a 2.6.39 kernel and then a kernel with Linus' patch to remove the prefetches. I also ran the tests on a pretty quiescent system. The Alpha (with EV67 cpu) can only count four types of hardware events (the raw event 4 below is the Mbox Replay Trap --- when the CPU during reordering instruction execution realises it has cocked up and lost track of data relationships with memory so has to completely toss the pipeline and restart it). I also did a 'make -j2' since I have far fewer CPUs. With prefetching in the hlist: Performance counter stats for 'make -j2' (100 runs): 24,410,837,606 cycles ( +- 0.042% ) (scaled from 50.45%) 23,667,556,717 instructions # 0.970 IPC ( +- 0.021% ) (scaled from 50.48%) 108,215,598 cache-misses ( +- 1.300% ) (scaled from 50.46%) 115,401,247 raw 0x4 ( +- 0.099% ) (scaled from 25.66%) 38.221901985 seconds time elapsed ( +- 0.676% ) Without prefetching: Performance counter stats for 'make -j2' (100 runs): 24,344,492,146 cycles ( +- 0.051% ) (scaled from 50.45%) 23,669,009,135 instructions # 0.972 IPC ( +- 0.023% ) (scaled from 50.46%) 106,124,519 cache-misses ( +- 1.233% ) (scaled from 50.46%) 115,385,694 raw 0x4 ( +- 0.105% ) (scaled from 25.67%) 38.232319169 seconds time elapsed ( +- 0.956% ) The execution time and number of instructions executed are the same to within measurement uncertainty. The number of cycles is increased by prefetching by 0.27% which is statistically significant but smaller than that reported for the x86. The differences in cache misses and mbox replay traps are not statistically significant. Thus there is no harm to the Alpha architecture in removing the prefetches and possibly a very small advantage. I also ran the user space test of Ingo's: > #include <stdlib.h> > #include <stdio.h> > #include <time.h> > > static inline void prefetch(const void *x) > { > asm volatile ("prefetchnta (%0)":: "r" (x)); Well, replaced that line with: __builtin_prefetch(x, 0, 3); > } > > #define BILLION (1000*1000*1000) > > int main (void) > { > int i; > > for (i = 0; i < BILLION; i++) { > prefetch(NULL); > prefetch(&i); > } > > return 0; > } and I measured the following: Performance counter stats for './prefetch_1b' (3 runs): 181,838,732,972 cycles ( +- 0.251% ) (scaled from 50.01%) 74,235,333,145 instructions # 0.408 IPC ( +- 0.282% ) (scaled from 50.01%) 5,137,103,532 cache-misses ( +- 94.820% ) (scaled from 49.99%) 1,003,419,087 raw 0x4 ( +- 0.018% ) (scaled from 25.01%) 292.441356154 seconds time elapsed ( +- 2.925% ) What a shocker---only 0.4 IPC and an apparent 74 instructions per loop iteration! Running it again with the prefetch(NULL) changed to prefetch(&i): Performance counter stats for './prefetch_1a' (3 runs): 2,013,886,830 cycles ( +- 0.054% ) (scaled from 49.97%) 5,999,675,015 instructions # 2.979 IPC ( +- 0.037% ) (scaled from 50.02%) 4,902,846 cache-misses ( +- 14.257% ) (scaled from 50.04%) 54,498 raw 0x4 ( +- 2.503% ) (scaled from 24.98%) 3.080415963 seconds time elapsed ( +- 0.560% ) Ah nice; an obvious 6 instructions per loop iteration that are taking 2 cycles to run for almost exactly 3 IPC. I think the problem prefetching NULL (it is relevant that I have the kernel config option CONFIG_DEFAULT_MMAP_MIN_ADDR=8192) is that the Alpha hardware does not necessarily dismiss a prefetch to an unmapped memory address, but may cause a CPU trap through to the PALcode which is then required to dismiss the prefetch without passing control to the kernel. The user space prefetch example is therefore illustrating the quite substantial damage of CPU traps through to PALcode. While this is not a concern to the kernel (it obviously has access to location 0) I can imagine some neophyte thinking (as I have done myself) "I'll look to see how the kernel implements lists and use that because it will be both clever and well tested code" without realising that the prefetch of a NULL is very inefficient for userspace! Cheers Michael. -- To unsubscribe from this list: send the line "unsubscribe linux-arch" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html