[RFC][PATCH] spin loop arch primitives for busy waiting

Nicholas Piggin <npiggin@xxxxxxxxx> · Mon, 3 Apr 2017 18:13:28 +1000

Hi,

I would like to revisit this again and see if people are opposed to this
arch primitive. We have attributed cases of suboptimal performance on
real customer workloads to this, so I'd like to find a solution.

Since last posting, I promised the s390 people I'd consider hypervisor
yield additions. I'd like to punt that until after getting the basic
primitives in. HV yielding can still be done within these loops, but
there's no real simple recipe that's useful to add, that I've found yet.
We have cpu_relax_yield, but it's barely used so I prefer to avoid
over-engineering something that's not well tested.

Thanks,
Nick


Current busy-wait loops are implemented by once calling cpu_relax() to
give a low latency arch option for improving power and/or SMT resource
consumption.

This poses some difficulties for powerpc, which has SMT priority setting
instructions where relative priorities between threads determine how
ifetch cycles are apportioned. cpu_relax() is implemented by setting a
low priority then setting normal priority. This has several problems:

 - Changing thread priority can have some execution cost and potential
   impact to other threads in the core. It's inefficient to execute them
   every time around a busy-wait loop.

 - Depending on implementation details, a `low ; medium` sequence may
   not have much if any affect. Some software with similar pattern
   actually inserts a lot of nops between, in order to cause a few
   fetch cycles with the low priority.

 - The rest of the busy-wait loop runs with regular priority. This
   might only be a few fetch cycles, but if there are several threads
   running such loops, they could cause a noticable impact on a non-idle
   thread.

Implement spin_do {} spin_while(), and spin_do {} spin_until()
primitives ("until" tends to be quite readable for busy-wait loops),
which allows powerpc to enter low SMT priority, run the loop, then enter
normal SMT priority.

The loops have some restrictions on what can be used, but they are
intended to be small and simple so it's not generally a problem:
 - Don't use cpu_relax.
 - Don't use return or goto.
 - Don't use sleeping or spinning primitives.

---
 include/linux/processor.h            | 43 ++++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)
 create mode 100644 include/linux/processor.h

diff --git a/include/linux/processor.h b/include/linux/processor.h
new file mode 100644
index 000000000000..282457cd9b67
--- /dev/null
+++ b/include/linux/processor.h
@@ -0,0 +1,43 @@
+/* Misc low level processor primitives */
+#ifndef _LINUX_PROCESSOR_H
+#define _LINUX_PROCESSOR_H
+
+#include <asm/processor.h>
+
+/*
+ * Begin a busy-wait loop, terminated with spin_while() or spin_until().
+ * This can be used in place of cpu_relax, and should be optimized to be
+ * used where wait times are expected to be less than the cost of a context
+ * switch.
+ *
+ * These loops should be very simple, and avoid calling cpu_relax, or
+ * any "spin" or sleep type of primitive. That should not cause a bug, just
+ * possible suboptimal behaviour on some implementations.
+ *
+ * The most common / fastpath exit case(s) should be tested in the
+ * spin_while / spin_until conditions. Uncommon cases can use break from
+ * within the loop. Return and goto must not be used to exit from the
+ * loop.
+ *
+ * Guest yielding and such techniques are to be implemented by the caller.
+ */
+#ifndef spin_do
+#define spin_do								\
+do {									\
+	do {								\
+		cpu_relax();						\
+#endif
+
+#ifndef spin_while
+#define spin_while(cond)						\
+	} while (cond);							\
+} while (0)
+#endif
+
+#ifndef spin_until
+#define spin_until(cond)						\
+	} while (!(cond));						\
+} while (0)
+#endif
+
+#endif /* _LINUX_PROCESSOR_H */
-- 
2.11.0