Re: [PATCH 0/2] Minor updates

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Mon, 9 Dec 2019 16:08:12 -0800

On Tue, Dec 10, 2019 at 07:11:10AM +0900, Akira Yokosawa wrote:
> On 2019/12/10 3:06, Paul E. McKenney wrote:
> > On Mon, Dec 09, 2019 at 09:50:56PM +0900, Akira Yokosawa wrote:
> >> On Sun, 8 Dec 2019 10:11:20 -0800, Paul E. McKenney wrote:
> >>> On Mon, Dec 09, 2019 at 12:54:46AM +0900, Akira Yokosawa wrote:
> >>>> On Sat, 7 Dec 2019 17:15:50 -0800, Paul E. McKenney wrote:
> >>>>> On Sun, Dec 08, 2019 at 08:41:45AM +0900, Akira Yokosawa wrote:
> >>>>>> On Sat, 7 Dec 2019 08:43:08 -0800, Paul E. McKenney wrote:
> >>>>>>> On Sat, Dec 07, 2019 at 01:05:17PM +0900, Akira Yokosawa wrote:
> >>>>>>>> Hi Paul,
> >>>>>>>>
> >>>>>>>> This patch set fixes minor issues I noticed while reading your
> >>>>>>>> recent updates.
> >>>>>>>
> >>>>>>> Queued and pushed, along with a fix to another of my typos, thank
> >>>>>>> you very much!
> >>>>>>>
> >>>>>>>> Apart from the changes, I'd like you to mention in the answer to
> >>>>>>>> Quick Quiz 9.43 that modern Intel CPUs don't execute x86_64
> >>>>>>>> instructions directly, but decode them into uOPs (via MOP) and
> >>>>>>>> keep them in a uOP cache [1].
> >>>>>>>> So the execution cycle is not necessarily corresponds to instruction
> >>>>>>>> count, but heavily depends on the behavior of the microarch, which
> >>>>>>>> is not predictable without actually running the code. 
> >>>>>>>>
> >>>>>>>> [1]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)
> >>>>>>>
> >>>>>>> My thought is that I should review the "Hardware and it Habits" chapter,
> >>>>>>> add this information if it is not already present, and then make the
> >>>>>>> answer to this Quick Quiz refer back to that.  Does that seem reasonable?
> >>>>>>
> >>>>>> Yes, it sounds quite reasonable!
> >>>>>>
> >>>>>> (Skimming through the chapter...)
> >>>>>>
> >>>>>> So Section 3.1.1 lightly touches pipelining. Section 3.2 mostly discusses
> >>>>>> memory sub-systems.
> >>>>>>
> >>>>>> Modern Intel architectures can be thought of as superscalar RISC
> >>>>>> processors which emulate x86 ISA. The transformation of x86 instructions
> >>>>>> into uOPs can be thought of as another layer of optimization
> >>>>>> (sometimes "de-optimization" from compiler writer's POV) ;-).
> >>>>>>
> >>>>>> But deep-diving this topic would cost you another chapter/appendix.
> >>>>>> I'm not sure if it's worthwhile for perfbook.
> >>>>>> Maybe it would suffice to lightly touch the difficulty of
> >>>>>> predicting execution cycles of particular instruction streams
> >>>>>> on modern microprocessors (not limited to Intel's), and put
> >>>>>> a few citations of textbooks/reference manuals.
> >>>>>
> >>>>> What I did was to add a rough diagram and a paragraph or two of
> >>>>> explanation to Section 3.1.1, then add a reference to that section
> >>>>> in the Quick Quiz.
> >>>>
> >>>> I'd like to see a couple of more keywords to be mentioned here other
> >>>> than "pipeline".  "Super-scalar" is present in Glossary, but
> >>>> "Superscalar" looks much common these days. Appended below is
> >>>> a tentative patch I made to show you my idea. Please feel free
> >>>> to edit as you'd like before applying it.
> >>>>
> >>>> Another point I'd like to suggest.
> >>>> Figure 9.23 and the following figures still show the result on
> >>>> a 16 CPU system. Looks like it is difficult to make plots of 448-thread
> >>>> system corresponding to them. Can you add info on the HW system
> >>>> where those 16 CPU results were obtained in the beginning of
> >>>> Section 9.5.4.2?
> >>>>
> >>>>         Thanks, Akira
> >>>>
> >>>> -------------8<-------------------
> >>>> >From 7e5c17a2e816a23bb12aa23f18ed96d5e820fc3a Mon Sep 17 00:00:00 2001
> >>>> From: Akira Yokosawa <akiyks@xxxxxxxxx>
> >>>> Date: Mon, 9 Dec 2019 00:23:59 +0900
> >>>> Subject: [TENTATIVE PATCH] cpu/overview: Mention approaches other than 'pipelining'
> >>>>
> >>>> Also remove "-" from "Super-scaler" in Glossary.
> >>>>
> >>>> Signed-off-by: Akira Yokosawa <akiyks@xxxxxxxxx>
> >>>
> >>> Good points, thank you!
> >>>
> >>> Applied with a few inevitable edits.  ;-)
> >>
> >> Quite a few edits!  Thank you.
> >>
> >> Let me reiterate my earlier suggestion:
> >>
> >>>> Another point I'd like to suggest.
> >>>> Figure 9.23 and the following figures still show the result on
> >>>> a 16 CPU system. Looks like it is difficult to make plots of 448-thread
> >>>> system corresponding to them. Can you add info on the HW system
> >>>> where those 16 CPU results were obtained in the beginning of
> >>>> Section 9.5.4.2?
> >>
> >> Can you look into this as well?
> > 
> > There are a few build issues, but the main problem has been that I have
> > needed to use that system to verify Linux-kernel fixes.  The intent
> > is to regenerate most and maybe all of the results on the large system
> > over time.
> 
> I said "difficult" because of the counterintuitive variation of
> cycles you've encountered by the additional "lea" instruction.
> You will need to eliminate such variations to evaluate the cost
> of RCU, I suppose.
> Looks like Intel processors are sensitive to alignment of branch targets.
> (I think you know the matter better than me, but I could not help.)
> For example: https://stackoverflow.com/questions/18113995/

It does indeed get complicated.  ;-)

Another experiment on the todo list is to move the rcu_head structure to
the end, which should eliminate that extra lea instruction.  I am planning
to introduce that to the answer to the more-than-ideal quick quiz.

> > But I added the system's info in the meantime.  ;-)
> 
> Which generation of Intel x86 system was it?

I don't know, as that was before I got smart and started capturing
/proc/cpuinfo.  It was quite old, probably produced in 2010 or so.
Maybe even earlier.

Which is another good reason to rerun those results, but I don't see
this as blocking the release.

							Thanx, Paul

>         Thanks, Akira
> 
> > 
> > 							Thanx, Paul
> > 
> >>         Thanks, Akira
> >>
> >>>
> >>> 							Thanx, Paul
> >>>
> >>>> ---
> >>>>  cpu/overview.tex   | 22 ++++++++++++++--------
> >>>>  defer/rcuusage.tex |  2 +-
> >>>>  glossary.tex       |  4 ++--
> >>>>  3 files changed, 17 insertions(+), 11 deletions(-)
> >>>>
> >>>> diff --git a/cpu/overview.tex b/cpu/overview.tex
> >>>> index b80f47c1..191c1c68 100644
> >>>> --- a/cpu/overview.tex
> >>>> +++ b/cpu/overview.tex
> >>>> @@ -42,11 +42,13 @@ In the early 1980s, the typical microprocessor fetched an instruction,
> >>>>  decoded it, and executed it, typically taking \emph{at least} three
> >>>>  clock cycles to complete one instruction before proceeding to the next.
> >>>>  In contrast, the CPU of the late 1990s and of the 2000s execute
> >>>> -many instructions simultaneously, using a deep \emph{pipeline} to control
> >>>> +many instructions simultaneously, using a combination of approaches
> >>>> +including \emph{pipeline}, \emph{superscalar}, \emph{out-of-order},
> >>>> +and \emph{speculative} execution, to control
> >>>>  the flow of instructions internally to the CPU.
> >>>>  Some cores have more than one hardware thread, which is variously called
> >>>>  \emph{simultaneous multithreading} (SMT) or \emph{hyperthreading}
> >>>> -(HT)~\cite{JFennel1973SMT}.
> >>>> +(HT)~\cite{JFennel1973SMT},
> >>>>  each of which appears as
> >>>>  an independent CPU to software, at least from a functional viewpoint.
> >>>>  These modern hardware features can greatly improve performance, as
> >>>> @@ -96,14 +98,17 @@ Figure~\ref{fig:cpu:CPU Meets a Pipeline Flush}.
> >>>>  \end{figure}
> >>>>  
> >>>>  This gets even worse in the increasingly common case of hyperthreading
> >>>> -(or SMT, if you prefer).
> >>>> +(or SMT, if you prefer).\footnote{
> >>>> +	Superscalar is involved in most cases, too.
> >>>> +}
> >>>>  In this case, all the hardware threads sharing a core also share that
> >>>>  core's resources, including registers, cache, execution units, and so on.
> >>>> -The instruction streams are decoded into micro-operations, and use of the
> >>>> -shared execution units and the hundreds of hardware registers is coordinated
> >>>> +The instruction streams might be decoded into micro-operations,
> >>>> +and use of the shared execution units and the hundreds of hardware
> >>>> +registers can be coordinated
> >>>>  by a micro-operation scheduler.
> >>>> -A rough diagram of a two-threaded core is shown in
> >>>> -\Cref{fig:cpu:Rough View of Modern Micro-Architecture},
> >>>> +A rough diagram of such a two-threaded core is shown in
> >>>> +\cref{fig:cpu:Rough View of Modern Micro-Architecture},
> >>>>  and more accurate (and thus more complex) diagrams are available in
> >>>>  textbooks and scholarly papers.\footnote{
> >>>>  	Here is one example for a late-2010s Intel CPU:
> >>>> @@ -123,7 +128,8 @@ of clairvoyance.
> >>>>  In particular, adding an instruction to a tight loop can sometimes
> >>>>  actually speed up execution, counterintuitive though that might be.
> >>>>  
> >>>> -Unfortunately, pipeline flushes are not the only hazards in the obstacle
> >>>> +Unfortunately, pipeline flushes and shared-resource contentions
> >>>> +are not the only hazards in the obstacle
> >>>>  course that modern CPUs must run.
> >>>>  The next section covers the hazards of referencing memory.
> >>>>  
> >>>> diff --git a/defer/rcuusage.tex b/defer/rcuusage.tex
> >>>> index 7fe633c3..fa04ddb6 100644
> >>>> --- a/defer/rcuusage.tex
> >>>> +++ b/defer/rcuusage.tex
> >>>> @@ -139,7 +139,7 @@ that of the ideal synchronization-free workload.
> >>>>  	are long gone.
> >>>>  
> >>>>  	But those of you who read
> >>>> -	\Cref{sec:cpu:Pipelined CPUs}
> >>>> +	\cref{sec:cpu:Pipelined CPUs}
> >>>>  	carefully already knew all of this!
> >>>>  
> >>>>  	These counter-intuitive results of course means that any
> >>>> diff --git a/glossary.tex b/glossary.tex
> >>>> index c10ffe4e..4a3aa796 100644
> >>>> --- a/glossary.tex
> >>>> +++ b/glossary.tex
> >>>> @@ -382,11 +382,11 @@
> >>>>  	as well as its cache so as to ensure that the software sees
> >>>>  	the memory operations performed by this CPU as if they
> >>>>  	were carried out in program order.
> >>>> -\item[Super-Scalar CPU:]
> >>>> +\item[Superscalar CPU:]
> >>>>  	A scalar (non-vector) CPU capable of executing multiple instructions
> >>>>  	concurrently.
> >>>>  	This is a step up from a pipelined CPU that executes multiple
> >>>> -	instructions in an assembly-line fashion---in a super-scalar
> >>>> +	instructions in an assembly-line fashion---in a superscalar
> >>>>  	CPU, each stage of the pipeline would be capable of handling
> >>>>  	more than one instruction.
> >>>>  	For example, if the conditions were exactly right,
> >>>> -- 
> >>>> 2.17.1
> >>>>
> >>
>