Re: Clarify what the read memory barrier really does

Hao Lee <haolee.swjtu@xxxxxxxxx> · Wed, 20 Apr 2022 06:57:29 +0000

On Tue, Apr 19, 2022 at 10:31:25AM -0700, Paul E. McKenney wrote:
> On Mon, Apr 18, 2022 at 07:37:21AM +0000, Hao Lee wrote:
> > On Sun, Apr 17, 2022 at 10:34:06AM -0700, Paul E. McKenney wrote:
> > > On Sun, Apr 17, 2022 at 11:17:26AM +0000, Hao Lee wrote:
> > > > Hello,
> > > > 
> > > > I think maybe we can make the following contents more clear:
> > > 
> > > Too true, and thank you for spotting this!
> > > 
> > > > Cite from Appendix C.4:
> > > > 
> > > > 	when a given CPU executes a memory barrier, it marks all the
> > > > 	entries currently in its invalidate queue, and forces any
> > > > 	subsequent load to wait until all marked entries have been
> > > > 	applied to the CPU’s cache.
> > > > 
> > > > It's obvious that this paragraph means read barrier can flush invalidate
> > > > queue.
> > > 
> > > True, it -could- flush the invalidate queue.  Or it could just force later
> > > reads to wait until the invalidate queue drains of its own accord, which
> > > is what is actually described in the above passage.  Or it could implement
> > > a large number of possible strategies in between these two extremes.
> > 
> > This is quite interesting. Thanks.
> > 
> > > 
> > > The key point is that C.4 is describing implementation.  And implementation
> > > of full memory barriers.
> > > 
> > > > Cite from Appendix C.5:
> > > > 
> > > > 	The effect of this is that a read memory barrier orders only
> > > > 	loads on the CPU that executes it, so that all loads preceding
> > > > 	the read memory barrier will appear to have completed before any
> > > > 	load following the read memory barrier.
> > > > 
> > > > This paragraph means read barrier can prevent Load-Load memory
> > > > reordering which is caused by out-of-order execution.
> > > 
> > > This passage describes the software-visible effects of whatever
> > > implementation is actually used for a given system. 
> > 
> > This explanation makes sense to me. Thanks.
> > 
> > > Another passage in
> > > the preceding paragraph describes what is happening at the implementations
> > > level.
> > > 
> > > > If I understand correctly, read memory barrier has _two functions_, one
> > > > is flushing invalidate queue to make the loads following the barrier can
> > > > load the latest value, and the other is stalling instruction pipeline to
> > > > prevent Load-Load memory reordering. I think these are two completely
> > > > different functions and we should make such a summary in the book.
> > > 
> > > I would instead say that there are two different ways that memory barriers
> > > can interact with invalidate queues.  And there are two different
> > > levels of abstraction, hardware implementation (buffers and queues)
> > > and software-visible effect (ordering).
> > > 
> > > I queued the commit shown below.  Thoughts?
> > > 
> > > 							Thanx, Paul
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > commit 1389b9da9760040276f8c53215aaa96d964a0892
> > > Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > > Date:   Sun Apr 17 10:32:19 2022 -0700
> > > 
> > >     appendix/whymb: Clarify memory-barrier operation
> > >     
> > >     Reported-by: Hao Lee <haolee.swjtu@xxxxxxxxx>
> > >     Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > > 
> > > diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex
> > > index 8d58483f..8f607e35 100644
> > > --- a/appendix/whymb/whymemorybarriers.tex
> > > +++ b/appendix/whymb/whymemorybarriers.tex
> > > @@ -1233,33 +1233,76 @@ With this change, the sequence of operations might be as follows:
> > >  With much passing of MESI messages, the CPUs arrive at the correct answer.
> > >  This section illustrates why CPU designers must be extremely careful
> > >  with their cache-coherence optimizations.
> > > +The key requirement is that the memory barriers provide the appearance
> > > +of ordering to the software.
> > > +As long as these appearances are maintained, the hardware can carry
> > > +out whatever queueing, buffering, marking, stallings, and flushing
> > > +optimizations it likes.
> > 
> > I still have a question here. For the following example cited from
> > C.4.3, we know bar() could see the stale value of "a", which is 0. But
> > I'm curious why we regard "reading a stale value" as "an appearance of
> > reordering". It seems that the two terms are not the same concept.
> 
> They are indeed different concepts, but the software cannot distinguish
> them.

Got it !

> 
> > void foo(void)
> > {
> > 	a = 1;
> > 	smp_mb();
> > 	b = 1;
> > }
> > 
> > void bar(void)
> > {
> > 	while (b == 0) continue;
> > 	assert(a == 1);
> > }
> 
> Did the bar() function's loads from b and a get reordered?
> Or did the bar() function's load from a return a stale value?
> 
> The bar() function cannot tell the difference.

Ah, this is exactly what I want!
I once thought of this explanation, but I'm not sure. Thanks for
confirming this!

Thanks,
Hao Lee

> 
> Does that help, or am I missing your point?
> 
> > > -\section{Read and Write Memory Barriers}
> > > -\label{sec:app:whymb:Read and Write Memory Barriers}
> > > +\QuickQuiz{
> > > +	Instead of all of this marking of invalidation-queue entries
> > > +	and stalling of loads, why not simply force an immediate flush
> > > +	of the invalidation queue?
> > > +}\QuickQuizAnswer{
> > > +	An immediate flush of the invalidation queue would do the trick.
> > > +	Except that the common-case super-scalar CPU is executing many
> > > +	instructions at once, and not necessarily even in the expected
> > > +	order.
> > > +	So what would ``immediate'' even mean?
> > > +	The answer is clearly ``not much''.
> > > +
> > > +	Nevertheless, for simpler CPUs that execute instructions serially,
> > > +	flushing the invalidation queue might be a reasonable implementation
> > > +	strategy.
> > > +}\QuickQuizEnd
> > > +
> > > +\section{Read and Write Memory Barriers} \label{sec:app:whymb:Read and
> > > +Write Memory Barriers}
> > >  
> > > -In the previous section, memory barriers were used to mark entries in
> > > -both the store buffer and the invalidate queue.
> > > -But in our code fragment, \co{foo()} had no reason to do anything
> > > -with the invalidate queue, and \co{bar()} similarly had no reason
> > > -to do anything with the store buffer.
> > > +In the previous section, memory barriers were used to mark entries in both
> > > +the store buffer and the invalidate queue.
> > > +But in our code fragment, \co{foo()} had no reason to do anything with the
> > > +invalidate queue, and \co{bar()} similarly had no reason to do anything
> > > +with the store buffer.
> > >  
> > >  Many CPU architectures therefore provide weaker memory-barrier
> > >  instructions that do only one or the other of these two.
> > >  Roughly speaking, a ``read memory barrier'' marks only the invalidate
> > > -queue and a ``write memory barrier'' marks only the store buffer,
> > > -while a full-fledged memory barrier does both.
> > > -
> > > -The effect of this is that a read memory barrier orders only loads
> > > -on the CPU that executes it, so that all loads preceding the read memory
> > > -barrier will appear to have completed before any load following the
> > > -read memory barrier.
> > > -Similarly, a write memory barrier orders
> > > -only stores, again on the CPU that executes it, and again so that
> > > -all stores preceding the write memory barrier will appear to have
> > > -completed before any store following the write memory barrier.
> > > +queue (and snoops entries in the store buffer) and a ``write memory
> > > +barrier'' marks only the store buffer, while a full-fledged memory
> > > +barrier does all of the above.
> > > +
> > > +The software-visible effect of these hardware mechanisms is that a read
> > > +memory barrier orders only loads on the CPU that executes it, so that
> > > +all loads preceding the read memory barrier will appear to have completed
> > > +before any load following the read memory barrier.
> > > +Similarly, a write memory barrier orders only stores, again on the
> > > +CPU that executes it, and again so that all stores preceding the write
> > > +memory barrier will appear to have completed before any store following
> > > +the write memory barrier.
> > >  A full-fledged memory barrier orders both loads and stores, but again
> > >  only on the CPU executing the memory barrier.
> > >  
> > > +\QuickQuiz{
> > > +	But can't full memory barriers impose global ordering?
> > > +	After all, isn't that needed to provide the ordering
> > > +	shown in \cref{lst:formal:IRIW Litmus Test}?
> > > +}\QuickQuizAnswer{
> > > +	Sort of.
> > > +
> > > +	Note well that this litmus test has not one but two full
> > > +	memory-barrier instructions, namely the two \co{sync} instructions
> > > +	executed by \co{P2} and \co{P3}.
> > > +
> > > +	It is the interaction of those two instructions that provides
> > > +	the global ordering, not just their individual execution.
> > > +	For example, each of those two \co{sync} instructions might stall
> > > +	waiting for all CPUs to process their invalidation queues before
> > > +	allowing subsequent instructions to execute.\footnote{
> > > +		Real-life hardware of course applies many optimizations
> > > +		to minimize the resulting stalls.}
> > > +}\QuickQuizEnd
> > > +
> > >  If we update \co{foo} and \co{bar} to use read and write memory
> > >  barriers, they appear as follows:
> > >  
> > 
> > Other changes look good to me.
> 
> Very good, and thank you for looking them over!
> 
> 							Thanx, Paul