On Mon, Apr 18, 2022 at 07:37:21AM +0000, Hao Lee wrote: > On Sun, Apr 17, 2022 at 10:34:06AM -0700, Paul E. McKenney wrote: > > On Sun, Apr 17, 2022 at 11:17:26AM +0000, Hao Lee wrote: > > > Hello, > > > > > > I think maybe we can make the following contents more clear: > > > > Too true, and thank you for spotting this! > > > > > Cite from Appendix C.4: > > > > > > when a given CPU executes a memory barrier, it marks all the > > > entries currently in its invalidate queue, and forces any > > > subsequent load to wait until all marked entries have been > > > applied to the CPU’s cache. > > > > > > It's obvious that this paragraph means read barrier can flush invalidate > > > queue. > > > > True, it -could- flush the invalidate queue. Or it could just force later > > reads to wait until the invalidate queue drains of its own accord, which > > is what is actually described in the above passage. Or it could implement > > a large number of possible strategies in between these two extremes. > > This is quite interesting. Thanks. > > > > > The key point is that C.4 is describing implementation. And implementation > > of full memory barriers. > > > > > Cite from Appendix C.5: > > > > > > The effect of this is that a read memory barrier orders only > > > loads on the CPU that executes it, so that all loads preceding > > > the read memory barrier will appear to have completed before any > > > load following the read memory barrier. > > > > > > This paragraph means read barrier can prevent Load-Load memory > > > reordering which is caused by out-of-order execution. > > > > This passage describes the software-visible effects of whatever > > implementation is actually used for a given system. > > This explanation makes sense to me. Thanks. > > > Another passage in > > the preceding paragraph describes what is happening at the implementations > > level. > > > > > If I understand correctly, read memory barrier has _two functions_, one > > > is flushing invalidate queue to make the loads following the barrier can > > > load the latest value, and the other is stalling instruction pipeline to > > > prevent Load-Load memory reordering. I think these are two completely > > > different functions and we should make such a summary in the book. > > > > I would instead say that there are two different ways that memory barriers > > can interact with invalidate queues. And there are two different > > levels of abstraction, hardware implementation (buffers and queues) > > and software-visible effect (ordering). > > > > I queued the commit shown below. Thoughts? > > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > commit 1389b9da9760040276f8c53215aaa96d964a0892 > > Author: Paul E. McKenney <paulmck@xxxxxxxxxx> > > Date: Sun Apr 17 10:32:19 2022 -0700 > > > > appendix/whymb: Clarify memory-barrier operation > > > > Reported-by: Hao Lee <haolee.swjtu@xxxxxxxxx> > > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx> > > > > diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex > > index 8d58483f..8f607e35 100644 > > --- a/appendix/whymb/whymemorybarriers.tex > > +++ b/appendix/whymb/whymemorybarriers.tex > > @@ -1233,33 +1233,76 @@ With this change, the sequence of operations might be as follows: > > With much passing of MESI messages, the CPUs arrive at the correct answer. > > This section illustrates why CPU designers must be extremely careful > > with their cache-coherence optimizations. > > +The key requirement is that the memory barriers provide the appearance > > +of ordering to the software. > > +As long as these appearances are maintained, the hardware can carry > > +out whatever queueing, buffering, marking, stallings, and flushing > > +optimizations it likes. > > I still have a question here. For the following example cited from > C.4.3, we know bar() could see the stale value of "a", which is 0. But > I'm curious why we regard "reading a stale value" as "an appearance of > reordering". It seems that the two terms are not the same concept. They are indeed different concepts, but the software cannot distinguish them. > void foo(void) > { > a = 1; > smp_mb(); > b = 1; > } > > void bar(void) > { > while (b == 0) continue; > assert(a == 1); > } Did the bar() function's loads from b and a get reordered? Or did the bar() function's load from a return a stale value? The bar() function cannot tell the difference. Does that help, or am I missing your point? > > -\section{Read and Write Memory Barriers} > > -\label{sec:app:whymb:Read and Write Memory Barriers} > > +\QuickQuiz{ > > + Instead of all of this marking of invalidation-queue entries > > + and stalling of loads, why not simply force an immediate flush > > + of the invalidation queue? > > +}\QuickQuizAnswer{ > > + An immediate flush of the invalidation queue would do the trick. > > + Except that the common-case super-scalar CPU is executing many > > + instructions at once, and not necessarily even in the expected > > + order. > > + So what would ``immediate'' even mean? > > + The answer is clearly ``not much''. > > + > > + Nevertheless, for simpler CPUs that execute instructions serially, > > + flushing the invalidation queue might be a reasonable implementation > > + strategy. > > +}\QuickQuizEnd > > + > > +\section{Read and Write Memory Barriers} \label{sec:app:whymb:Read and > > +Write Memory Barriers} > > > > -In the previous section, memory barriers were used to mark entries in > > -both the store buffer and the invalidate queue. > > -But in our code fragment, \co{foo()} had no reason to do anything > > -with the invalidate queue, and \co{bar()} similarly had no reason > > -to do anything with the store buffer. > > +In the previous section, memory barriers were used to mark entries in both > > +the store buffer and the invalidate queue. > > +But in our code fragment, \co{foo()} had no reason to do anything with the > > +invalidate queue, and \co{bar()} similarly had no reason to do anything > > +with the store buffer. > > > > Many CPU architectures therefore provide weaker memory-barrier > > instructions that do only one or the other of these two. > > Roughly speaking, a ``read memory barrier'' marks only the invalidate > > -queue and a ``write memory barrier'' marks only the store buffer, > > -while a full-fledged memory barrier does both. > > - > > -The effect of this is that a read memory barrier orders only loads > > -on the CPU that executes it, so that all loads preceding the read memory > > -barrier will appear to have completed before any load following the > > -read memory barrier. > > -Similarly, a write memory barrier orders > > -only stores, again on the CPU that executes it, and again so that > > -all stores preceding the write memory barrier will appear to have > > -completed before any store following the write memory barrier. > > +queue (and snoops entries in the store buffer) and a ``write memory > > +barrier'' marks only the store buffer, while a full-fledged memory > > +barrier does all of the above. > > + > > +The software-visible effect of these hardware mechanisms is that a read > > +memory barrier orders only loads on the CPU that executes it, so that > > +all loads preceding the read memory barrier will appear to have completed > > +before any load following the read memory barrier. > > +Similarly, a write memory barrier orders only stores, again on the > > +CPU that executes it, and again so that all stores preceding the write > > +memory barrier will appear to have completed before any store following > > +the write memory barrier. > > A full-fledged memory barrier orders both loads and stores, but again > > only on the CPU executing the memory barrier. > > > > +\QuickQuiz{ > > + But can't full memory barriers impose global ordering? > > + After all, isn't that needed to provide the ordering > > + shown in \cref{lst:formal:IRIW Litmus Test}? > > +}\QuickQuizAnswer{ > > + Sort of. > > + > > + Note well that this litmus test has not one but two full > > + memory-barrier instructions, namely the two \co{sync} instructions > > + executed by \co{P2} and \co{P3}. > > + > > + It is the interaction of those two instructions that provides > > + the global ordering, not just their individual execution. > > + For example, each of those two \co{sync} instructions might stall > > + waiting for all CPUs to process their invalidation queues before > > + allowing subsequent instructions to execute.\footnote{ > > + Real-life hardware of course applies many optimizations > > + to minimize the resulting stalls.} > > +}\QuickQuizEnd > > + > > If we update \co{foo} and \co{bar} to use read and write memory > > barriers, they appear as follows: > > > > Other changes look good to me. Very good, and thank you for looking them over! Thanx, Paul