Software prefetching considered harmful

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 19 May 2011 10:12:22 -0700

So I spent some time the last few days looking at kernel performance
on a load that I personally see all the time: "make" on a tree that is
pretty much fully built. So we don't have the actual cost of
compilation, just the cost of _checking_ what needs to be compiled.

It turns out that "make" is pretty piggish (what else is new), but
this is actually a fairly kernel-intensive load: there's a fair amount
of time spent in filename lookup. And we're doing very well in the
filename lookup, except now that I can enable SELinux and not lose the
advantage of RCU lookup, I see selinux - and particularly AVC -
sucking dead donkey fetuses through a straw.

However, the top offender in the top function (avc_has_perm_noaudit())
was actually our generic hlist software prefetching use. On both my
Atom laptop (which I was compiling kernels on because I was off
getting my master diver certification) and on my workstation Core-i5
670, the profiles clearly showed the prefetch instruction to be the
top offender by far.

And it's not because it gets all the cache misses. Removing the
prefetch actually IMPROVES PERFORMANCE.

It turns out that apparently the prefetch causes problems for the
pipeline when it doesn't hit in the L1 dTLB. And that pretty much
*always* happens, for a very simple reason: the hlist walk will
eventually encounter the final NULL. In fact, since it's used for hash
tables, it usually will encounter it pretty much immediately. In fact,
from what I can tell, at least in that avc_has_perm_noaudit() case,
the NULL case is the *common* case - the hash lists are usually short,
and if it has a single entry (which is common), the *only* prefetch we
do is for that NULL pointer.

I'm not kidding. In hlist_for_each[_entry](), we don't actually
prefetch the first entry ("head->next"), but we *will* prefetch the
NULL ("pos->next").

That's just incredible crap. It's incredible crap that costs us 0.5%
on the simple "make a fully made kernel" benchmark on some of the most
common modern hardware out there. Seriously.

Now, Ingo did some more testing, and on his hardware it looks like
even if you special-case NULL, prefetching at all still just hurts -
because it causes more L1 dcache activity. For a hash chain, if you
actually hit the entry, you do NOT want to prefetch the next entry -
even if it isn't NULL.

So even aside from the fact that apparently "prefetch(NULL)" is a bad
thing on some x86 microarchitectures, it really looks like prefetching
is simply bad - full stop.

So I've got a few choices:

 (a) Just stop using prefetch on x86, because the hardware
implementation of the software prefetch is clearly broken on common
hardware, and the hardware prefetchers do a better job *without* us
messing around. Plus it gets rid of stupid cache pollution.

 (b) make x86 prefetch be conditional and avoid prefetching NULL.  But
see above: it improves things, but not as much as not doing
prefetching at all.

 (c) just stop the insane prefetching in the hlist cases.

and quite frankly, right now I think (c) is the right choice. The fact
is, with a hash table, I think it makes sense maybe trying to prefetch
the first entry, but prefetching all the entries *but* the first one,
and prefetching the NULL? That's just crazy.

I'm not going to do (b). Making a prefetch conditional is stupid. The
conditional isn't going to predict well (think hash chains again -
even if they aren't empty or a single entry, we're talking *short*
chains or you have bigger issues).

And while I could just tell the x86 people to stop generating the
prefetch instruction at all, I do feel that that makes it totally
pointless to ever have software prefetch hints in the kernel source
code, if the major architecture is known to just ignore them because
they are useless.

But before I commit that, I'll give architecture people a heads up.

Now, notice that right now I'm *only* talking about removing it for
the "hlist" cases (patch attached). I suspect we should do the same
thing for all the list helpers.

If some particular code actually knows the list is long and the
prefetch is worth it, maybe we could do it on a case-by-case basis.
But this "do prefetching of the next entry in the hlist loop" just
seems like utter crap.

Comments?

                     Linus
 include/linux/list.h |    9 ++++-----
 1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index 3a54266a1e85..9ac11148e037 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -664,8 +664,7 @@ static inline void hlist_move_list(struct hlist_head *old,
 #define hlist_entry(ptr, type, member) container_of(ptr,type,member)
 
 #define hlist_for_each(pos, head) \
-	for (pos = (head)->first; pos && ({ prefetch(pos->next); 1; }); \
-	     pos = pos->next)
+	for (pos = (head)->first; pos ; pos = pos->next)
 
 #define hlist_for_each_safe(pos, n, head) \
 	for (pos = (head)->first; pos && ({ n = pos->next; 1; }); \
@@ -680,7 +679,7 @@ static inline void hlist_move_list(struct hlist_head *old,
  */
 #define hlist_for_each_entry(tpos, pos, head, member)			 \
 	for (pos = (head)->first;					 \
-	     pos && ({ prefetch(pos->next); 1;}) &&			 \
+	     pos &&							 \
 		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
 	     pos = pos->next)
 
@@ -692,7 +691,7 @@ static inline void hlist_move_list(struct hlist_head *old,
  */
 #define hlist_for_each_entry_continue(tpos, pos, member)		 \
 	for (pos = (pos)->next;						 \
-	     pos && ({ prefetch(pos->next); 1;}) &&			 \
+	     pos &&							 \
 		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
 	     pos = pos->next)
 
@@ -703,7 +702,7 @@ static inline void hlist_move_list(struct hlist_head *old,
  * @member:	the name of the hlist_node within the struct.
  */
 #define hlist_for_each_entry_from(tpos, pos, member)			 \
-	for (; pos && ({ prefetch(pos->next); 1;}) &&			 \
+	for (; pos &&							 \
 		({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
 	     pos = pos->next)