>From c7255fa8b6fc7835c0eb6ab524aed3349cea1dca Mon Sep 17 00:00:00 2001 From: Akira Yokosawa <akiyks@xxxxxxxxx> Date: Sun, 1 Oct 2017 12:40:18 +0900 Subject: [PATCH 06/10] treewide: Use \Power{} macro for POWER CPU family Signed-off-by: Akira Yokosawa <akiyks@xxxxxxxxx> --- appendix/toyrcu/toyrcu.tex | 26 +++++++++++++------------- count/count.tex | 6 +++--- intro/intro.tex | 2 +- memorder/memorder.tex | 30 +++++++++++++++--------------- perfbook.tex | 1 + toolsoftrade/toolsoftrade.tex | 4 ++-- 6 files changed, 35 insertions(+), 34 deletions(-) diff --git a/appendix/toyrcu/toyrcu.tex b/appendix/toyrcu/toyrcu.tex index db45fad..2c65f74 100644 --- a/appendix/toyrcu/toyrcu.tex +++ b/appendix/toyrcu/toyrcu.tex @@ -73,7 +73,7 @@ Of course, only one RCU reader may be in its read-side critical section at a time, which almost entirely defeats the purpose of RCU. In addition, the lock operations in \co{rcu_read_lock()} and \co{rcu_read_unlock()} are extremely heavyweight, -with read-side overhead ranging from about 100~nanoseconds on a single Power5 +with read-side overhead ranging from about 100~nanoseconds on a single \Power{5} CPU up to more than 17~\emph{microseconds} on a 64-CPU system. Worse yet, these same lock operations permit \co{rcu_read_lock()} @@ -216,7 +216,7 @@ with a single global lock. Furthermore, the read-side overhead, though high at roughly 140 nanoseconds, remains at about 140 nanoseconds regardless of the number of CPUs. However, the update-side overhead ranges from about 600 nanoseconds -on a single Power5 CPU +on a single \Power{5} CPU up to more than 100 \emph{microseconds} on 64 CPUs. \QuickQuiz{} @@ -368,7 +368,7 @@ However, this implementations still has some serious shortcomings. First, the atomic operations in \co{rcu_read_lock()} and \co{rcu_read_unlock()} are still quite heavyweight, with read-side overhead ranging from about 100~nanoseconds on -a single Power5 CPU up to almost 40~\emph{microseconds} +a single \Power{5} CPU up to almost 40~\emph{microseconds} on a 64-CPU system. This means that the RCU read-side critical sections have to be extremely long in order to get any real @@ -718,9 +718,9 @@ In fact, they are more complex than those of the single-counter variant shown in Figure~\ref{fig:app:toyrcu:RCU Implementation Using Single Global Reference Counter}, with the read-side primitives consuming about 150~nanoseconds on a single -Power5 CPU and almost 40~\emph{microseconds} on a 64-CPU system. +\Power{5} CPU and almost 40~\emph{microseconds} on a 64-CPU system. The update-side \co{synchronize_rcu()} primitive is more costly as -well, ranging from about 200~nanoseconds on a single Power5 CPU to +well, ranging from about 200~nanoseconds on a single \Power{5} CPU to more than 40~\emph{microseconds} on a 64-CPU system. This means that the RCU read-side critical sections have to be extremely long in order to get any real @@ -963,9 +963,9 @@ environments. That said, the read-side primitives scale very nicely, requiring about 115~nanoseconds regardless of whether running on a single-CPU or a 64-CPU -Power5 system. +\Power{5} system. As noted above, the \co{synchronize_rcu()} primitive does not scale, -ranging in overhead from almost a microsecond on a single Power5 CPU +ranging in overhead from almost a microsecond on a single \Power{5} CPU up to almost 200~microseconds on a 64-CPU system. This implementation could conceivably form the basis for a production-quality user-level RCU implementation. @@ -1340,9 +1340,9 @@ destruction will not be reordered into the preceding loop. This approach achieves much better read-side performance, incurring roughly 63~nanoseconds of overhead regardless of the number of -Power5 CPUs. +\Power{5} CPUs. Updates incur more overhead, ranging from about 500~nanoseconds on -a single Power5 CPU to more than 100~\emph{microseconds} on 64 +a single \Power{5} CPU to more than 100~\emph{microseconds} on 64 such CPUs. \QuickQuiz{} @@ -1542,9 +1542,9 @@ This approach achieves read-side performance almost equal to that shown in Section~\ref{sec:app:toyrcu:RCU Based on Free-Running Counter}, incurring roughly 65~nanoseconds of overhead regardless of the number of -Power5 CPUs. +\Power{5} CPUs. Updates again incur more overhead, ranging from about 600~nanoseconds on -a single Power5 CPU to more than 100~\emph{microseconds} on 64 +a single \Power{5} CPU to more than 100~\emph{microseconds} on 64 such CPUs. \QuickQuiz{} @@ -1866,11 +1866,11 @@ This implementation has blazingly fast read-side primitives, with an \co{rcu_read_lock()}-\co{rcu_read_unlock()} round trip incurring an overhead of roughly 50~\emph{picoseconds}. The \co{synchronize_rcu()} overhead ranges from about 600~nanoseconds -on a single-CPU Power5 system up to more than 100~microseconds on +on a single-CPU \Power{5} system up to more than 100~microseconds on a 64-CPU system. \QuickQuiz{} - To be sure, the clock frequencies of Power + To be sure, the clock frequencies of \Power{} systems in 2008 were quite high, but even a 5\,GHz clock frequency is insufficient to allow loops to be executed in 50~picoseconds! diff --git a/count/count.tex b/count/count.tex index 73b6866..a38aba1 100644 --- a/count/count.tex +++ b/count/count.tex @@ -3330,7 +3330,7 @@ will expand on these lessons. \path{count_end_rcu.c} & \ref{sec:together:RCU and Per-Thread-Variable-Based Statistical Counters} & 5.7 ns & 354 ns & 501 ns \\ \end{tabular} -\caption{Statistical Counter Performance on Power-6} +\caption{Statistical Counter Performance on \Power{6}} \label{tab:count:Statistical Counter Performance on Power-6} \end{table*} @@ -3410,14 +3410,14 @@ courtesy of eventual consistency. \path{count_lim_sig.c} & \ref{sec:count:Signal-Theft Limit Counter Implementation} & Y & 10.2 ns & 370 ns & 54,000 ns \\ \end{tabular} -\caption{Limit Counter Performance on Power-6} +\caption{Limit Counter Performance on \Power{6}} \label{tab:count:Limit Counter Performance on Power-6} \end{table*} Figure~\ref{tab:count:Limit Counter Performance on Power-6} shows the performance of the parallel limit-counting algorithms. Exact enforcement of the limits incurs a substantial performance -penalty, although on this 4.7\,GHz Power-6 system that penalty can be reduced +penalty, although on this 4.7\,GHz \Power{6} system that penalty can be reduced by substituting signals for atomic operations. All of these implementations suffer from read-side lock contention in the face of concurrent readers. diff --git a/intro/intro.tex b/intro/intro.tex index 8bed518..293a02f 100644 --- a/intro/intro.tex +++ b/intro/intro.tex @@ -77,7 +77,7 @@ that of a bicycle, courtesy of Moore's Law. Papers calling out the advantages of multicore CPUs were published as early as 1996~\cite{Olukotun96}. IBM introduced simultaneous multi-threading -into its high-end POWER family in 2000, and multicore in 2001. +into its high-end \Power{} family in 2000, and multicore in 2001. Intel introduced hyperthreading into its commodity Pentium line in November 2000, and both AMD and Intel introduced dual-core CPUs in 2005. diff --git a/memorder/memorder.tex b/memorder/memorder.tex index 7dc3fb4..944c17a 100644 --- a/memorder/memorder.tex +++ b/memorder/memorder.tex @@ -314,7 +314,7 @@ synchronization primitives (such as locking and RCU) that are responsible for maintaining the illusion of ordering through use of \emph{memory barriers} (for example, \co{smp_mb()} in the Linux kernel). These memory barriers can be explicit instructions, as they are on -ARM, POWER, Itanium, and Alpha, or they can be implied by other instructions, +ARM, \Power{}, Itanium, and Alpha, or they can be implied by other instructions, as they often are on x86. Since these standard synchronization primitives preserve the illusion of ordering, your path of least resistance is to simply use these primitives, @@ -827,7 +827,7 @@ if the shared variable had changed before entry into the loop. This allows us to plot each CPU's view of the value of \co{state.variable} over a 532-nanosecond time period, as shown in Figure~\ref{fig:memorder:A Variable With Multiple Simultaneous Values}. -This data was collected in 2006 on 1.5\,GHz POWER5 system with 8 cores, +This data was collected in 2006 on 1.5\,GHz \Power{5} system with 8 cores, each containing a pair of hardware threads. CPUs~1, 2, 3, and~4 recorded the values, while CPU~0 controlled the test. The timebase counter period was about 5.32\,ns, sufficiently fine-grained @@ -2043,7 +2043,7 @@ communicated to \co{P1()} long before it was communicated to \co{P2()}. \QuickQuizAnswer{ You need to face the fact that it really can trigger. Akira Yokosawa used the \co{litmus7} tool to run this litmus test - on a Power8 system. + on a \Power{8} system. Out of 1,000,000,000 runs, 4 triggered the \co{exists} clause. Thus, triggering the \co{exists} clause is not merely a one-in-a-million occurrence, but rather a one-in-a-hundred-million occurrence. @@ -3707,7 +3707,7 @@ dependencies. \rotatebox{90}{PA-RISC CPUs} \end{picture} & \begin{picture}(6,60)(0,0) - \rotatebox{90}{POWER} + \rotatebox{90}{\Power{}} \end{picture} & \begin{picture}(6,60)(0,0) \rotatebox{90}{SPARC TSO} @@ -4134,7 +4134,7 @@ For more on Alpha, see its reference manual~\cite{ALPHA2002}. The ARM family of CPUs is extremely popular in embedded applications, particularly for power-constrained applications such as cellphones. -Its memory model is similar to that of Power +Its memory model is similar to that of \Power{} (see Section~\ref{sec:memorder:POWER / PowerPC}, but ARM uses a different set of memory-barrier instructions~\cite{ARMv7A:2010}: @@ -4144,7 +4144,7 @@ different set of memory-barrier instructions~\cite{ARMv7A:2010}: subsequent operations of the same type. The ``type'' of operations can be all operations or can be restricted to only writes (similar to the Alpha \co{wmb} - and the POWER \co{eieio} instructions). + and the \Power{} \co{eieio} instructions). In addition, ARM allows cache coherence to have one of three scopes: single processor, a subset of the processors (``inner'') and global (``outer''). @@ -4168,7 +4168,7 @@ None of these instructions exactly match the semantics of Linux's \co{DMB}. The \co{DMB} and \co{DSB} instructions have a recursive definition of accesses ordered before and after the barrier, which has an effect -similar to that of POWER's cumulativity. +similar to that of \Power{}'s cumulativity. ARM also implements control dependencies, so that if a conditional branch depends on a load, then any store executed after that conditional @@ -4292,7 +4292,7 @@ memory barriers. \subsection{MIPS} The MIPS memory model~\cite[Table 6.6]{MIPSvII-A-2015} -appears to resemble that of ARM, Itanium, and Power, +appears to resemble that of ARM, Itanium, and \Power{}, being weakly ordered by default, but respecting dependencies. MIPS has a wide variety of memory-barrier instructions, but ties them not to hardware considerations, but rather to the use cases provided @@ -4325,7 +4325,7 @@ in a manner similar to the ARM64 additions: Informal discussions with MIPS architects indicates that MIPS has a definition of transitivity or cumulativity similar to that of -ARM and Power. +ARM and \Power{}. However, it appears that different MIPS implementations can have different memory-ordering properties, so it is important to consult the documentation for the specific MIPS implementation you are using. @@ -4339,10 +4339,10 @@ no code, however, they do use the gcc {\tt memory} attribute to disable compiler optimizations that would reorder code across the memory barrier. -\subsection{POWER / PowerPC} +\subsection{\Power{} / PowerPC} \label{sec:memorder:POWER / PowerPC} -The POWER and PowerPC\textsuperscript{\textregistered} +The \Power{} and PowerPC\textsuperscript{\textregistered} CPU families have a wide variety of memory-barrier instructions~\cite{PowerPC94,MichaelLyons05a}: \begin{description} @@ -4388,7 +4388,7 @@ The \co{smp_mb()} instruction is also defined to be the {\tt sync} instruction, but both \co{smp_rmb()} and \co{rmb()} are defined to be the lighter-weight {\tt lwsync} instruction. -Power features ``cumulativity'', which can be used to obtain +\Power{} features ``cumulativity'', which can be used to obtain transitivity. When used properly, any code seeing the results of an earlier code fragment will also see the accesses that this earlier code @@ -4396,11 +4396,11 @@ fragment itself saw. Much more detail is available from McKenney and Silvera~\cite{PaulEMcKenneyN2745r2009}. -Power respects control dependencies in much the same way that ARM -does, with the exception that the Power \co{isync} instruction +\Power{} respects control dependencies in much the same way that ARM +does, with the exception that the \Power{} \co{isync} instruction is substituted for the ARM \co{ISB} instruction. -Many members of the POWER architecture have incoherent instruction +Many members of the \Power{} architecture have incoherent instruction caches, so that a store to memory will not necessarily be reflected in the instruction cache. Thankfully, few people write self-modifying code these days, but JITs diff --git a/perfbook.tex b/perfbook.tex index da9cfa8..cc4f4b0 100644 --- a/perfbook.tex +++ b/perfbook.tex @@ -138,6 +138,7 @@ \newcommand{\qop}[1]{{\sffamily #1}} % QC operator such as H, T, S, etc. \DeclareRobustCommand{\euler}{\ensuremath{\mathrm{e}}} +\newcommand{\Power}[1]{POWER#1} \newcommand{\Epigraph}[2]{\epigraphhead[65]{\rmfamily\epigraph{#1}{#2}}} diff --git a/toolsoftrade/toolsoftrade.tex b/toolsoftrade/toolsoftrade.tex index 9cf3312..97a37d3 100644 --- a/toolsoftrade/toolsoftrade.tex +++ b/toolsoftrade/toolsoftrade.tex @@ -1038,7 +1038,7 @@ Line~39 moves the lock-acquisition count to this thread's element of the \end{figure} Figure~\ref{fig:toolsoftrade:Reader-Writer Lock Scalability} -shows the results of running this test on a 64-core Power-5 system +shows the results of running this test on a 64-core \Power{5} system with two hardware threads per core for a total of 128 software-visible CPUs. The \co{thinktime} parameter was zero for all these tests, and the @@ -1137,7 +1137,7 @@ This situation will only get worse as you add CPUs. } \QuickQuizEnd \QuickQuiz{} - Power-5 is several years old, and new hardware should + \Power{5} is several years old, and new hardware should be faster. So why should anyone worry about reader-writer locks being slow? \QuickQuizAnswer{ -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe perfbook" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html