On 2019/12/10 9:08, Paul E. McKenney wrote: > On Tue, Dec 10, 2019 at 07:11:10AM +0900, Akira Yokosawa wrote: >> On 2019/12/10 3:06, Paul E. McKenney wrote: >>> On Mon, Dec 09, 2019 at 09:50:56PM +0900, Akira Yokosawa wrote: >>>> On Sun, 8 Dec 2019 10:11:20 -0800, Paul E. McKenney wrote: >>>>> On Mon, Dec 09, 2019 at 12:54:46AM +0900, Akira Yokosawa wrote: >>>>>> On Sat, 7 Dec 2019 17:15:50 -0800, Paul E. McKenney wrote: >>>>>>> On Sun, Dec 08, 2019 at 08:41:45AM +0900, Akira Yokosawa wrote: >>>>>>>> On Sat, 7 Dec 2019 08:43:08 -0800, Paul E. McKenney wrote: >>>>>>>>> On Sat, Dec 07, 2019 at 01:05:17PM +0900, Akira Yokosawa wrote: >>>>>>>>>> Hi Paul, >>>>>>>>>> >>>>>>>>>> This patch set fixes minor issues I noticed while reading your >>>>>>>>>> recent updates. >>>>>>>>> >>>>>>>>> Queued and pushed, along with a fix to another of my typos, thank >>>>>>>>> you very much! >>>>>>>>> >>>>>>>>>> Apart from the changes, I'd like you to mention in the answer to >>>>>>>>>> Quick Quiz 9.43 that modern Intel CPUs don't execute x86_64 >>>>>>>>>> instructions directly, but decode them into uOPs (via MOP) and >>>>>>>>>> keep them in a uOP cache [1]. >>>>>>>>>> So the execution cycle is not necessarily corresponds to instruction >>>>>>>>>> count, but heavily depends on the behavior of the microarch, which >>>>>>>>>> is not predictable without actually running the code. >>>>>>>>>> >>>>>>>>>> [1]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server) >>>>>>>>> >>>>>>>>> My thought is that I should review the "Hardware and it Habits" chapter, >>>>>>>>> add this information if it is not already present, and then make the >>>>>>>>> answer to this Quick Quiz refer back to that. Does that seem reasonable? >>>>>>>> >>>>>>>> Yes, it sounds quite reasonable! >>>>>>>> >>>>>>>> (Skimming through the chapter...) >>>>>>>> >>>>>>>> So Section 3.1.1 lightly touches pipelining. Section 3.2 mostly discusses >>>>>>>> memory sub-systems. >>>>>>>> >>>>>>>> Modern Intel architectures can be thought of as superscalar RISC >>>>>>>> processors which emulate x86 ISA. The transformation of x86 instructions >>>>>>>> into uOPs can be thought of as another layer of optimization >>>>>>>> (sometimes "de-optimization" from compiler writer's POV) ;-). >>>>>>>> >>>>>>>> But deep-diving this topic would cost you another chapter/appendix. >>>>>>>> I'm not sure if it's worthwhile for perfbook. >>>>>>>> Maybe it would suffice to lightly touch the difficulty of >>>>>>>> predicting execution cycles of particular instruction streams >>>>>>>> on modern microprocessors (not limited to Intel's), and put >>>>>>>> a few citations of textbooks/reference manuals. >>>>>>> >>>>>>> What I did was to add a rough diagram and a paragraph or two of >>>>>>> explanation to Section 3.1.1, then add a reference to that section >>>>>>> in the Quick Quiz. >>>>>> >>>>>> I'd like to see a couple of more keywords to be mentioned here other >>>>>> than "pipeline". "Super-scalar" is present in Glossary, but >>>>>> "Superscalar" looks much common these days. Appended below is >>>>>> a tentative patch I made to show you my idea. Please feel free >>>>>> to edit as you'd like before applying it. >>>>>> >>>>>> Another point I'd like to suggest. >>>>>> Figure 9.23 and the following figures still show the result on >>>>>> a 16 CPU system. Looks like it is difficult to make plots of 448-thread >>>>>> system corresponding to them. Can you add info on the HW system >>>>>> where those 16 CPU results were obtained in the beginning of >>>>>> Section 9.5.4.2? >>>>>> >>>>>> Thanks, Akira >>>>>> >>>>>> -------------8<------------------- >>>>>> >From 7e5c17a2e816a23bb12aa23f18ed96d5e820fc3a Mon Sep 17 00:00:00 2001 >>>>>> From: Akira Yokosawa <akiyks@xxxxxxxxx> >>>>>> Date: Mon, 9 Dec 2019 00:23:59 +0900 >>>>>> Subject: [TENTATIVE PATCH] cpu/overview: Mention approaches other than 'pipelining' >>>>>> >>>>>> Also remove "-" from "Super-scaler" in Glossary. >>>>>> >>>>>> Signed-off-by: Akira Yokosawa <akiyks@xxxxxxxxx> >>>>> >>>>> Good points, thank you! >>>>> >>>>> Applied with a few inevitable edits. ;-) >>>> >>>> Quite a few edits! Thank you. >>>> >>>> Let me reiterate my earlier suggestion: >>>> >>>>>> Another point I'd like to suggest. >>>>>> Figure 9.23 and the following figures still show the result on >>>>>> a 16 CPU system. Looks like it is difficult to make plots of 448-thread >>>>>> system corresponding to them. Can you add info on the HW system >>>>>> where those 16 CPU results were obtained in the beginning of >>>>>> Section 9.5.4.2? >>>> >>>> Can you look into this as well? >>> >>> There are a few build issues, but the main problem has been that I have >>> needed to use that system to verify Linux-kernel fixes. The intent >>> is to regenerate most and maybe all of the results on the large system >>> over time. >> >> I said "difficult" because of the counterintuitive variation of >> cycles you've encountered by the additional "lea" instruction. >> You will need to eliminate such variations to evaluate the cost >> of RCU, I suppose. >> Looks like Intel processors are sensitive to alignment of branch targets. >> (I think you know the matter better than me, but I could not help.) >> For example: https://stackoverflow.com/questions/18113995/ > > It does indeed get complicated. ;-) > > Another experiment on the todo list is to move the rcu_head structure to > the end, which should eliminate that extra lea instruction. I am planning > to introduce that to the answer to the more-than-ideal quick quiz. That sounds like quite reasonable. > >>> But I added the system's info in the meantime. ;-) >> >> Which generation of Intel x86 system was it? > > I don't know, as that was before I got smart and started capturing > /proc/cpuinfo. It was quite old, probably produced in 2010 or so. > Maybe even earlier. (Digging up the git history...) Yes, this plot has existed ever since the first commit of perfbook. And I won't blame you if you don't remember exactly what type of machine you ran the performance tests on. x86 in 2008 means it was pre-Nehalem, wasn't it? There remains a table of data obtained on Nehalem in 2009, which was added in commit 38fd945ff401 ("Fill out CPU chapter, including adding Nehalem data."). > > Which is another good reason to rerun those results, but I don't see > this as blocking the release. Agreed. Thanks, Akira > > Thanx, Paul > [...]