Re: Clarify what the read memory barrier really does

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Wed, 20 Apr 2022 20:58:39 -0700

On Wed, Apr 20, 2022 at 06:57:29AM +0000, Hao Lee wrote:
> On Tue, Apr 19, 2022 at 10:31:25AM -0700, Paul E. McKenney wrote:
> > On Mon, Apr 18, 2022 at 07:37:21AM +0000, Hao Lee wrote:
> > > On Sun, Apr 17, 2022 at 10:34:06AM -0700, Paul E. McKenney wrote:
> > > > On Sun, Apr 17, 2022 at 11:17:26AM +0000, Hao Lee wrote:
> > > > > Hello,
> > > > > 
> > > > > I think maybe we can make the following contents more clear:
> > > > 
> > > > Too true, and thank you for spotting this!
> > > > 
> > > > > Cite from Appendix C.4:
> > > > > 
> > > > > 	when a given CPU executes a memory barrier, it marks all the
> > > > > 	entries currently in its invalidate queue, and forces any
> > > > > 	subsequent load to wait until all marked entries have been
> > > > > 	applied to the CPU’s cache.
> > > > > 
> > > > > It's obvious that this paragraph means read barrier can flush invalidate
> > > > > queue.
> > > > 
> > > > True, it -could- flush the invalidate queue.  Or it could just force later
> > > > reads to wait until the invalidate queue drains of its own accord, which
> > > > is what is actually described in the above passage.  Or it could implement
> > > > a large number of possible strategies in between these two extremes.
> > > 
> > > This is quite interesting. Thanks.
> > > 
> > > > 
> > > > The key point is that C.4 is describing implementation.  And implementation
> > > > of full memory barriers.
> > > > 
> > > > > Cite from Appendix C.5:
> > > > > 
> > > > > 	The effect of this is that a read memory barrier orders only
> > > > > 	loads on the CPU that executes it, so that all loads preceding
> > > > > 	the read memory barrier will appear to have completed before any
> > > > > 	load following the read memory barrier.
> > > > > 
> > > > > This paragraph means read barrier can prevent Load-Load memory
> > > > > reordering which is caused by out-of-order execution.
> > > > 
> > > > This passage describes the software-visible effects of whatever
> > > > implementation is actually used for a given system. 
> > > 
> > > This explanation makes sense to me. Thanks.
> > > 
> > > > Another passage in
> > > > the preceding paragraph describes what is happening at the implementations
> > > > level.
> > > > 
> > > > > If I understand correctly, read memory barrier has _two functions_, one
> > > > > is flushing invalidate queue to make the loads following the barrier can
> > > > > load the latest value, and the other is stalling instruction pipeline to
> > > > > prevent Load-Load memory reordering. I think these are two completely
> > > > > different functions and we should make such a summary in the book.
> > > > 
> > > > I would instead say that there are two different ways that memory barriers
> > > > can interact with invalidate queues.  And there are two different
> > > > levels of abstraction, hardware implementation (buffers and queues)
> > > > and software-visible effect (ordering).
> > > > 
> > > > I queued the commit shown below.  Thoughts?
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > ------------------------------------------------------------------------
> > > > 
> > > > commit 1389b9da9760040276f8c53215aaa96d964a0892
> > > > Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > > > Date:   Sun Apr 17 10:32:19 2022 -0700
> > > > 
> > > >     appendix/whymb: Clarify memory-barrier operation
> > > >     
> > > >     Reported-by: Hao Lee <haolee.swjtu@xxxxxxxxx>
> > > >     Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > > > 
> > > > diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex
> > > > index 8d58483f..8f607e35 100644
> > > > --- a/appendix/whymb/whymemorybarriers.tex
> > > > +++ b/appendix/whymb/whymemorybarriers.tex
> > > > @@ -1233,33 +1233,76 @@ With this change, the sequence of operations might be as follows:
> > > >  With much passing of MESI messages, the CPUs arrive at the correct answer.
> > > >  This section illustrates why CPU designers must be extremely careful
> > > >  with their cache-coherence optimizations.
> > > > +The key requirement is that the memory barriers provide the appearance
> > > > +of ordering to the software.
> > > > +As long as these appearances are maintained, the hardware can carry
> > > > +out whatever queueing, buffering, marking, stallings, and flushing
> > > > +optimizations it likes.
> > > 
> > > I still have a question here. For the following example cited from
> > > C.4.3, we know bar() could see the stale value of "a", which is 0. But
> > > I'm curious why we regard "reading a stale value" as "an appearance of
> > > reordering". It seems that the two terms are not the same concept.
> > 
> > They are indeed different concepts, but the software cannot distinguish
> > them.
> 
> Got it !
> 
> > 
> > > void foo(void)
> > > {
> > > 	a = 1;
> > > 	smp_mb();
> > > 	b = 1;
> > > }
> > > 
> > > void bar(void)
> > > {
> > > 	while (b == 0) continue;
> > > 	assert(a == 1);
> > > }
> > 
> > Did the bar() function's loads from b and a get reordered?
> > Or did the bar() function's load from a return a stale value?
> > 
> > The bar() function cannot tell the difference.
> 
> Ah, this is exactly what I want!
> I once thought of this explanation, but I'm not sure. Thanks for
> confirming this!

I added the following QQ.  Does that help?

							Thanx, Paul

------------------------------------------------------------------------

commit 089f8a025a5ce4adc3a8f97b975ed638e8fb7a95
Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
Date:   Wed Apr 20 20:56:22 2022 -0700

    appendix/whymb: Add stale/reorded QQ
    
    Reported-by: Hao Lee <haolee.swjtu@xxxxxxxxx>
    Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>

diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex
index 347635a4..2140eb8a 100644
--- a/appendix/whymb/whymemorybarriers.tex
+++ b/appendix/whymb/whymemorybarriers.tex
@@ -857,21 +857,33 @@ Then the sequence of operations might be as follows:
 \item	CPU~0 receives the cache line containing ``a'' and applies
 	the buffered store just in time to fall victim to CPU~1's
 	failed assertion.
+	\label{seq:app:whymb:Store Buffers and Memory Barriers victim}
 \end{sequence}
 
-\QuickQuiz{
+\EQuickQuiz{
 	In \cref{seq:app:whymb:Store Buffers and Memory Barriers} above,
 	why does CPU~0 need to issue a ``read invalidate''
 	rather than a simple ``invalidate''?
 	After all, \co{foo()} will overwrite the variable \co{a} in any
 	case, so why should it care about the old value of \co{a}?
-}\QuickQuizAnswer{
+}\EQuickQuizAnswer{
 	Because the cache line in question contains more data than just the
 	variable \co{a}.
 	Issuing ``invalidate'' instead of the needed ``read invalidate''
 	would cause that other data to be lost, which would constitute
 	a serious bug in the hardware.
-}\QuickQuizEnd
+}\EQuickQuizEnd
+
+\EQuickQuiz{
+	In \cref{seq:app:whymb:Store Buffers and Memory Barriers victim}
+	above, did \co{bar()} read a stale value from \co{a}, or did
+	its reads of \co{b} and \co{a} get reordered?
+}\EQuickQuizAnswer{
+	It could be either, depending on the hardware implementation.
+	And it really does not matter which.
+	After all, the \co{bar()} function's \co{assert()} cannot tell
+	the difference!
+}\EQuickQuizEnd
 
 The hardware designers cannot help directly here, since the CPUs have
 no idea which variables are related, let alone how they might be related.