Re: Clarify what the read memory barrier really does

Hao Lee <haolee.swjtu@xxxxxxxxx> · Mon, 18 Apr 2022 07:37:21 +0000

On Sun, Apr 17, 2022 at 10:34:06AM -0700, Paul E. McKenney wrote:
> On Sun, Apr 17, 2022 at 11:17:26AM +0000, Hao Lee wrote:
> > Hello,
> > 
> > I think maybe we can make the following contents more clear:
> 
> Too true, and thank you for spotting this!
> 
> > Cite from Appendix C.4:
> > 
> > 	when a given CPU executes a memory barrier, it marks all the
> > 	entries currently in its invalidate queue, and forces any
> > 	subsequent load to wait until all marked entries have been
> > 	applied to the CPU’s cache.
> > 
> > It's obvious that this paragraph means read barrier can flush invalidate
> > queue.
> 
> True, it -could- flush the invalidate queue.  Or it could just force later
> reads to wait until the invalidate queue drains of its own accord, which
> is what is actually described in the above passage.  Or it could implement
> a large number of possible strategies in between these two extremes.

This is quite interesting. Thanks.

> 
> The key point is that C.4 is describing implementation.  And implementation
> of full memory barriers.
> 
> > Cite from Appendix C.5:
> > 
> > 	The effect of this is that a read memory barrier orders only
> > 	loads on the CPU that executes it, so that all loads preceding
> > 	the read memory barrier will appear to have completed before any
> > 	load following the read memory barrier.
> > 
> > This paragraph means read barrier can prevent Load-Load memory
> > reordering which is caused by out-of-order execution.
> 
> This passage describes the software-visible effects of whatever
> implementation is actually used for a given system. 

This explanation makes sense to me. Thanks.

> Another passage in
> the preceding paragraph describes what is happening at the implementations
> level.
> 
> > If I understand correctly, read memory barrier has _two functions_, one
> > is flushing invalidate queue to make the loads following the barrier can
> > load the latest value, and the other is stalling instruction pipeline to
> > prevent Load-Load memory reordering. I think these are two completely
> > different functions and we should make such a summary in the book.
> 
> I would instead say that there are two different ways that memory barriers
> can interact with invalidate queues.  And there are two different
> levels of abstraction, hardware implementation (buffers and queues)
> and software-visible effect (ordering).
> 
> I queued the commit shown below.  Thoughts?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit 1389b9da9760040276f8c53215aaa96d964a0892
> Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
> Date:   Sun Apr 17 10:32:19 2022 -0700
> 
>     appendix/whymb: Clarify memory-barrier operation
>     
>     Reported-by: Hao Lee <haolee.swjtu@xxxxxxxxx>
>     Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
> 
> diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex
> index 8d58483f..8f607e35 100644
> --- a/appendix/whymb/whymemorybarriers.tex
> +++ b/appendix/whymb/whymemorybarriers.tex
> @@ -1233,33 +1233,76 @@ With this change, the sequence of operations might be as follows:
>  With much passing of MESI messages, the CPUs arrive at the correct answer.
>  This section illustrates why CPU designers must be extremely careful
>  with their cache-coherence optimizations.
> +The key requirement is that the memory barriers provide the appearance
> +of ordering to the software.
> +As long as these appearances are maintained, the hardware can carry
> +out whatever queueing, buffering, marking, stallings, and flushing
> +optimizations it likes.

I still have a question here. For the following example cited from
C.4.3, we know bar() could see the stale value of "a", which is 0. But
I'm curious why we regard "reading a stale value" as "an appearance of
reordering". It seems that the two terms are not the same concept.

void foo(void)
{
	a = 1;
	smp_mb();
	b = 1;
}

void bar(void)
{
	while (b == 0) continue;
	assert(a == 1);
}

>  
> -\section{Read and Write Memory Barriers}
> -\label{sec:app:whymb:Read and Write Memory Barriers}
> +\QuickQuiz{
> +	Instead of all of this marking of invalidation-queue entries
> +	and stalling of loads, why not simply force an immediate flush
> +	of the invalidation queue?
> +}\QuickQuizAnswer{
> +	An immediate flush of the invalidation queue would do the trick.
> +	Except that the common-case super-scalar CPU is executing many
> +	instructions at once, and not necessarily even in the expected
> +	order.
> +	So what would ``immediate'' even mean?
> +	The answer is clearly ``not much''.
> +
> +	Nevertheless, for simpler CPUs that execute instructions serially,
> +	flushing the invalidation queue might be a reasonable implementation
> +	strategy.
> +}\QuickQuizEnd
> +
> +\section{Read and Write Memory Barriers} \label{sec:app:whymb:Read and
> +Write Memory Barriers}
>  
> -In the previous section, memory barriers were used to mark entries in
> -both the store buffer and the invalidate queue.
> -But in our code fragment, \co{foo()} had no reason to do anything
> -with the invalidate queue, and \co{bar()} similarly had no reason
> -to do anything with the store buffer.
> +In the previous section, memory barriers were used to mark entries in both
> +the store buffer and the invalidate queue.
> +But in our code fragment, \co{foo()} had no reason to do anything with the
> +invalidate queue, and \co{bar()} similarly had no reason to do anything
> +with the store buffer.
>  
>  Many CPU architectures therefore provide weaker memory-barrier
>  instructions that do only one or the other of these two.
>  Roughly speaking, a ``read memory barrier'' marks only the invalidate
> -queue and a ``write memory barrier'' marks only the store buffer,
> -while a full-fledged memory barrier does both.
> -
> -The effect of this is that a read memory barrier orders only loads
> -on the CPU that executes it, so that all loads preceding the read memory
> -barrier will appear to have completed before any load following the
> -read memory barrier.
> -Similarly, a write memory barrier orders
> -only stores, again on the CPU that executes it, and again so that
> -all stores preceding the write memory barrier will appear to have
> -completed before any store following the write memory barrier.
> +queue (and snoops entries in the store buffer) and a ``write memory
> +barrier'' marks only the store buffer, while a full-fledged memory
> +barrier does all of the above.
> +
> +The software-visible effect of these hardware mechanisms is that a read
> +memory barrier orders only loads on the CPU that executes it, so that
> +all loads preceding the read memory barrier will appear to have completed
> +before any load following the read memory barrier.
> +Similarly, a write memory barrier orders only stores, again on the
> +CPU that executes it, and again so that all stores preceding the write
> +memory barrier will appear to have completed before any store following
> +the write memory barrier.
>  A full-fledged memory barrier orders both loads and stores, but again
>  only on the CPU executing the memory barrier.
>  
> +\QuickQuiz{
> +	But can't full memory barriers impose global ordering?
> +	After all, isn't that needed to provide the ordering
> +	shown in \cref{lst:formal:IRIW Litmus Test}?
> +}\QuickQuizAnswer{
> +	Sort of.
> +
> +	Note well that this litmus test has not one but two full
> +	memory-barrier instructions, namely the two \co{sync} instructions
> +	executed by \co{P2} and \co{P3}.
> +
> +	It is the interaction of those two instructions that provides
> +	the global ordering, not just their individual execution.
> +	For example, each of those two \co{sync} instructions might stall
> +	waiting for all CPUs to process their invalidation queues before
> +	allowing subsequent instructions to execute.\footnote{
> +		Real-life hardware of course applies many optimizations
> +		to minimize the resulting stalls.}
> +}\QuickQuizEnd
> +
>  If we update \co{foo} and \co{bar} to use read and write memory
>  barriers, they appear as follows:
>  

Other changes look good to me.

Thanks,
Hao Lee