Re: [PATCH 0/2] Minor updates

Akira Yokosawa <akiyks@xxxxxxxxx> · Mon, 9 Dec 2019 00:54:46 +0900

On Sat, 7 Dec 2019 17:15:50 -0800, Paul E. McKenney wrote:
> On Sun, Dec 08, 2019 at 08:41:45AM +0900, Akira Yokosawa wrote:
>> On Sat, 7 Dec 2019 08:43:08 -0800, Paul E. McKenney wrote:
>>> On Sat, Dec 07, 2019 at 01:05:17PM +0900, Akira Yokosawa wrote:
>>>> Hi Paul,
>>>>
>>>> This patch set fixes minor issues I noticed while reading your
>>>> recent updates.
>>>
>>> Queued and pushed, along with a fix to another of my typos, thank
>>> you very much!
>>>
>>>> Apart from the changes, I'd like you to mention in the answer to
>>>> Quick Quiz 9.43 that modern Intel CPUs don't execute x86_64
>>>> instructions directly, but decode them into uOPs (via MOP) and
>>>> keep them in a uOP cache [1].
>>>> So the execution cycle is not necessarily corresponds to instruction
>>>> count, but heavily depends on the behavior of the microarch, which
>>>> is not predictable without actually running the code. 
>>>>
>>>> [1]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
>>>
>>> My thought is that I should review the "Hardware and it Habits" chapter,
>>> add this information if it is not already present, and then make the
>>> answer to this Quick Quiz refer back to that.  Does that seem reasonable?
>>
>> Yes, it sounds quite reasonable!
>>
>> (Skimming through the chapter...)
>>
>> So Section 3.1.1 lightly touches pipelining. Section 3.2 mostly discusses
>> memory sub-systems.
>>
>> Modern Intel architectures can be thought of as superscalar RISC
>> processors which emulate x86 ISA. The transformation of x86 instructions
>> into uOPs can be thought of as another layer of optimization
>> (sometimes "de-optimization" from compiler writer's POV) ;-).
>>
>> But deep-diving this topic would cost you another chapter/appendix.
>> I'm not sure if it's worthwhile for perfbook.
>> Maybe it would suffice to lightly touch the difficulty of
>> predicting execution cycles of particular instruction streams
>> on modern microprocessors (not limited to Intel's), and put
>> a few citations of textbooks/reference manuals.
> 
> What I did was to add a rough diagram and a paragraph or two of
> explanation to Section 3.1.1, then add a reference to that section
> in the Quick Quiz.

I'd like to see a couple of more keywords to be mentioned here other
than "pipeline".  "Super-scalar" is present in Glossary, but
"Superscalar" looks much common these days. Appended below is
a tentative patch I made to show you my idea. Please feel free
to edit as you'd like before applying it.

Another point I'd like to suggest.
Figure 9.23 and the following figures still show the result on
a 16 CPU system. Looks like it is difficult to make plots of 448-thread
system corresponding to them. Can you add info on the HW system
where those 16 CPU results were obtained in the beginning of
Section 9.5.4.2?

        Thanks, Akira

-------------8<-------------------
>From 7e5c17a2e816a23bb12aa23f18ed96d5e820fc3a Mon Sep 17 00:00:00 2001
From: Akira Yokosawa <akiyks@xxxxxxxxx>
Date: Mon, 9 Dec 2019 00:23:59 +0900
Subject: [TENTATIVE PATCH] cpu/overview: Mention approaches other than 'pipelining'

Also remove "-" from "Super-scaler" in Glossary.

Signed-off-by: Akira Yokosawa <akiyks@xxxxxxxxx>
---
 cpu/overview.tex   | 22 ++++++++++++++--------
 defer/rcuusage.tex |  2 +-
 glossary.tex       |  4 ++--
 3 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/cpu/overview.tex b/cpu/overview.tex
index b80f47c1..191c1c68 100644
--- a/cpu/overview.tex
+++ b/cpu/overview.tex
@@ -42,11 +42,13 @@ In the early 1980s, the typical microprocessor fetched an instruction,
 decoded it, and executed it, typically taking \emph{at least} three
 clock cycles to complete one instruction before proceeding to the next.
 In contrast, the CPU of the late 1990s and of the 2000s execute
-many instructions simultaneously, using a deep \emph{pipeline} to control
+many instructions simultaneously, using a combination of approaches
+including \emph{pipeline}, \emph{superscalar}, \emph{out-of-order},
+and \emph{speculative} execution, to control
 the flow of instructions internally to the CPU.
 Some cores have more than one hardware thread, which is variously called
 \emph{simultaneous multithreading} (SMT) or \emph{hyperthreading}
-(HT)~\cite{JFennel1973SMT}.
+(HT)~\cite{JFennel1973SMT},
 each of which appears as
 an independent CPU to software, at least from a functional viewpoint.
 These modern hardware features can greatly improve performance, as
@@ -96,14 +98,17 @@ Figure~\ref{fig:cpu:CPU Meets a Pipeline Flush}.
 \end{figure}
 
 This gets even worse in the increasingly common case of hyperthreading
-(or SMT, if you prefer).
+(or SMT, if you prefer).\footnote{
+	Superscalar is involved in most cases, too.
+}
 In this case, all the hardware threads sharing a core also share that
 core's resources, including registers, cache, execution units, and so on.
-The instruction streams are decoded into micro-operations, and use of the
-shared execution units and the hundreds of hardware registers is coordinated
+The instruction streams might be decoded into micro-operations,
+and use of the shared execution units and the hundreds of hardware
+registers can be coordinated
 by a micro-operation scheduler.
-A rough diagram of a two-threaded core is shown in
-\Cref{fig:cpu:Rough View of Modern Micro-Architecture},
+A rough diagram of such a two-threaded core is shown in
+\cref{fig:cpu:Rough View of Modern Micro-Architecture},
 and more accurate (and thus more complex) diagrams are available in
 textbooks and scholarly papers.\footnote{
 	Here is one example for a late-2010s Intel CPU:
@@ -123,7 +128,8 @@ of clairvoyance.
 In particular, adding an instruction to a tight loop can sometimes
 actually speed up execution, counterintuitive though that might be.
 
-Unfortunately, pipeline flushes are not the only hazards in the obstacle
+Unfortunately, pipeline flushes and shared-resource contentions
+are not the only hazards in the obstacle
 course that modern CPUs must run.
 The next section covers the hazards of referencing memory.
 
diff --git a/defer/rcuusage.tex b/defer/rcuusage.tex
index 7fe633c3..fa04ddb6 100644
--- a/defer/rcuusage.tex
+++ b/defer/rcuusage.tex
@@ -139,7 +139,7 @@ that of the ideal synchronization-free workload.
 	are long gone.
 
 	But those of you who read
-	\Cref{sec:cpu:Pipelined CPUs}
+	\cref{sec:cpu:Pipelined CPUs}
 	carefully already knew all of this!
 
 	These counter-intuitive results of course means that any
diff --git a/glossary.tex b/glossary.tex
index c10ffe4e..4a3aa796 100644
--- a/glossary.tex
+++ b/glossary.tex
@@ -382,11 +382,11 @@
 	as well as its cache so as to ensure that the software sees
 	the memory operations performed by this CPU as if they
 	were carried out in program order.
-\item[Super-Scalar CPU:]
+\item[Superscalar CPU:]
 	A scalar (non-vector) CPU capable of executing multiple instructions
 	concurrently.
 	This is a step up from a pipelined CPU that executes multiple
-	instructions in an assembly-line fashion---in a super-scalar
+	instructions in an assembly-line fashion---in a superscalar
 	CPU, each stage of the pipeline would be capable of handling
 	more than one instruction.
 	For example, if the conditions were exactly right,
-- 
2.17.1