On Mon, Dec 09, 2019 at 09:50:56PM +0900, Akira Yokosawa wrote: > On Sun, 8 Dec 2019 10:11:20 -0800, Paul E. McKenney wrote: > > On Mon, Dec 09, 2019 at 12:54:46AM +0900, Akira Yokosawa wrote: > >> On Sat, 7 Dec 2019 17:15:50 -0800, Paul E. McKenney wrote: > >>> On Sun, Dec 08, 2019 at 08:41:45AM +0900, Akira Yokosawa wrote: > >>>> On Sat, 7 Dec 2019 08:43:08 -0800, Paul E. McKenney wrote: > >>>>> On Sat, Dec 07, 2019 at 01:05:17PM +0900, Akira Yokosawa wrote: > >>>>>> Hi Paul, > >>>>>> > >>>>>> This patch set fixes minor issues I noticed while reading your > >>>>>> recent updates. > >>>>> > >>>>> Queued and pushed, along with a fix to another of my typos, thank > >>>>> you very much! > >>>>> > >>>>>> Apart from the changes, I'd like you to mention in the answer to > >>>>>> Quick Quiz 9.43 that modern Intel CPUs don't execute x86_64 > >>>>>> instructions directly, but decode them into uOPs (via MOP) and > >>>>>> keep them in a uOP cache [1]. > >>>>>> So the execution cycle is not necessarily corresponds to instruction > >>>>>> count, but heavily depends on the behavior of the microarch, which > >>>>>> is not predictable without actually running the code. > >>>>>> > >>>>>> [1]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server) > >>>>> > >>>>> My thought is that I should review the "Hardware and it Habits" chapter, > >>>>> add this information if it is not already present, and then make the > >>>>> answer to this Quick Quiz refer back to that. Does that seem reasonable? > >>>> > >>>> Yes, it sounds quite reasonable! > >>>> > >>>> (Skimming through the chapter...) > >>>> > >>>> So Section 3.1.1 lightly touches pipelining. Section 3.2 mostly discusses > >>>> memory sub-systems. > >>>> > >>>> Modern Intel architectures can be thought of as superscalar RISC > >>>> processors which emulate x86 ISA. The transformation of x86 instructions > >>>> into uOPs can be thought of as another layer of optimization > >>>> (sometimes "de-optimization" from compiler writer's POV) ;-). > >>>> > >>>> But deep-diving this topic would cost you another chapter/appendix. > >>>> I'm not sure if it's worthwhile for perfbook. > >>>> Maybe it would suffice to lightly touch the difficulty of > >>>> predicting execution cycles of particular instruction streams > >>>> on modern microprocessors (not limited to Intel's), and put > >>>> a few citations of textbooks/reference manuals. > >>> > >>> What I did was to add a rough diagram and a paragraph or two of > >>> explanation to Section 3.1.1, then add a reference to that section > >>> in the Quick Quiz. > >> > >> I'd like to see a couple of more keywords to be mentioned here other > >> than "pipeline". "Super-scalar" is present in Glossary, but > >> "Superscalar" looks much common these days. Appended below is > >> a tentative patch I made to show you my idea. Please feel free > >> to edit as you'd like before applying it. > >> > >> Another point I'd like to suggest. > >> Figure 9.23 and the following figures still show the result on > >> a 16 CPU system. Looks like it is difficult to make plots of 448-thread > >> system corresponding to them. Can you add info on the HW system > >> where those 16 CPU results were obtained in the beginning of > >> Section 9.5.4.2? > >> > >> Thanks, Akira > >> > >> -------------8<------------------- > >> >From 7e5c17a2e816a23bb12aa23f18ed96d5e820fc3a Mon Sep 17 00:00:00 2001 > >> From: Akira Yokosawa <akiyks@xxxxxxxxx> > >> Date: Mon, 9 Dec 2019 00:23:59 +0900 > >> Subject: [TENTATIVE PATCH] cpu/overview: Mention approaches other than 'pipelining' > >> > >> Also remove "-" from "Super-scaler" in Glossary. > >> > >> Signed-off-by: Akira Yokosawa <akiyks@xxxxxxxxx> > > > > Good points, thank you! > > > > Applied with a few inevitable edits. ;-) > > Quite a few edits! Thank you. > > Let me reiterate my earlier suggestion: > > >> Another point I'd like to suggest. > >> Figure 9.23 and the following figures still show the result on > >> a 16 CPU system. Looks like it is difficult to make plots of 448-thread > >> system corresponding to them. Can you add info on the HW system > >> where those 16 CPU results were obtained in the beginning of > >> Section 9.5.4.2? > > Can you look into this as well? There are a few build issues, but the main problem has been that I have needed to use that system to verify Linux-kernel fixes. The intent is to regenerate most and maybe all of the results on the large system over time. But I added the system's info in the meantime. ;-) Thanx, Paul > Thanks, Akira > > > > > Thanx, Paul > > > >> --- > >> cpu/overview.tex | 22 ++++++++++++++-------- > >> defer/rcuusage.tex | 2 +- > >> glossary.tex | 4 ++-- > >> 3 files changed, 17 insertions(+), 11 deletions(-) > >> > >> diff --git a/cpu/overview.tex b/cpu/overview.tex > >> index b80f47c1..191c1c68 100644 > >> --- a/cpu/overview.tex > >> +++ b/cpu/overview.tex > >> @@ -42,11 +42,13 @@ In the early 1980s, the typical microprocessor fetched an instruction, > >> decoded it, and executed it, typically taking \emph{at least} three > >> clock cycles to complete one instruction before proceeding to the next. > >> In contrast, the CPU of the late 1990s and of the 2000s execute > >> -many instructions simultaneously, using a deep \emph{pipeline} to control > >> +many instructions simultaneously, using a combination of approaches > >> +including \emph{pipeline}, \emph{superscalar}, \emph{out-of-order}, > >> +and \emph{speculative} execution, to control > >> the flow of instructions internally to the CPU. > >> Some cores have more than one hardware thread, which is variously called > >> \emph{simultaneous multithreading} (SMT) or \emph{hyperthreading} > >> -(HT)~\cite{JFennel1973SMT}. > >> +(HT)~\cite{JFennel1973SMT}, > >> each of which appears as > >> an independent CPU to software, at least from a functional viewpoint. > >> These modern hardware features can greatly improve performance, as > >> @@ -96,14 +98,17 @@ Figure~\ref{fig:cpu:CPU Meets a Pipeline Flush}. > >> \end{figure} > >> > >> This gets even worse in the increasingly common case of hyperthreading > >> -(or SMT, if you prefer). > >> +(or SMT, if you prefer).\footnote{ > >> + Superscalar is involved in most cases, too. > >> +} > >> In this case, all the hardware threads sharing a core also share that > >> core's resources, including registers, cache, execution units, and so on. > >> -The instruction streams are decoded into micro-operations, and use of the > >> -shared execution units and the hundreds of hardware registers is coordinated > >> +The instruction streams might be decoded into micro-operations, > >> +and use of the shared execution units and the hundreds of hardware > >> +registers can be coordinated > >> by a micro-operation scheduler. > >> -A rough diagram of a two-threaded core is shown in > >> -\Cref{fig:cpu:Rough View of Modern Micro-Architecture}, > >> +A rough diagram of such a two-threaded core is shown in > >> +\cref{fig:cpu:Rough View of Modern Micro-Architecture}, > >> and more accurate (and thus more complex) diagrams are available in > >> textbooks and scholarly papers.\footnote{ > >> Here is one example for a late-2010s Intel CPU: > >> @@ -123,7 +128,8 @@ of clairvoyance. > >> In particular, adding an instruction to a tight loop can sometimes > >> actually speed up execution, counterintuitive though that might be. > >> > >> -Unfortunately, pipeline flushes are not the only hazards in the obstacle > >> +Unfortunately, pipeline flushes and shared-resource contentions > >> +are not the only hazards in the obstacle > >> course that modern CPUs must run. > >> The next section covers the hazards of referencing memory. > >> > >> diff --git a/defer/rcuusage.tex b/defer/rcuusage.tex > >> index 7fe633c3..fa04ddb6 100644 > >> --- a/defer/rcuusage.tex > >> +++ b/defer/rcuusage.tex > >> @@ -139,7 +139,7 @@ that of the ideal synchronization-free workload. > >> are long gone. > >> > >> But those of you who read > >> - \Cref{sec:cpu:Pipelined CPUs} > >> + \cref{sec:cpu:Pipelined CPUs} > >> carefully already knew all of this! > >> > >> These counter-intuitive results of course means that any > >> diff --git a/glossary.tex b/glossary.tex > >> index c10ffe4e..4a3aa796 100644 > >> --- a/glossary.tex > >> +++ b/glossary.tex > >> @@ -382,11 +382,11 @@ > >> as well as its cache so as to ensure that the software sees > >> the memory operations performed by this CPU as if they > >> were carried out in program order. > >> -\item[Super-Scalar CPU:] > >> +\item[Superscalar CPU:] > >> A scalar (non-vector) CPU capable of executing multiple instructions > >> concurrently. > >> This is a step up from a pipelined CPU that executes multiple > >> - instructions in an assembly-line fashion---in a super-scalar > >> + instructions in an assembly-line fashion---in a superscalar > >> CPU, each stage of the pipeline would be capable of handling > >> more than one instruction. > >> For example, if the conditions were exactly right, > >> -- > >> 2.17.1 > >> >