Re: The answer of Quiz C.8 is not quite reasonable

Hao Lee <haolee.swjtu@xxxxxxxxx> · Thu, 21 Apr 2022 13:34:03 +0000

On Wed, Apr 20, 2022 at 11:15:53AM -0700, Paul E. McKenney wrote:
> On Wed, Apr 20, 2022 at 06:45:23AM +0000, Hao Lee wrote:
> > On Tue, Apr 19, 2022 at 10:28:50AM -0700, Paul E. McKenney wrote:
> > > On Mon, Apr 18, 2022 at 08:01:17AM +0000, Hao Lee wrote:
> > > > On Sun, Apr 17, 2022 at 10:44:54AM -0700, Paul E. McKenney wrote:
> > > > > On Thu, Apr 14, 2022 at 05:42:25PM +0000, Hao Lee wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > At the beginning of C.3.3 we have supposed the cache line containing "a"
> > > > > > resides _only_ in _CPU1’s_ cache. I think this is why _CPU0_ has to send
> > > > > > a "_read_ invalidate message" to _retrieve_ the cache line and invalid
> > > > > > CPU1's cache line.
> > > > > > 
> > > > > > However, the answer says the reason is the cache line in question
> > > > > > contains more than just the variable a. I can't understand the logical
> > > > > > relationship between this answer and the question. Am I missing
> > > > > > something here? Thanks.
> > > > > 
> > > > > I added the commit shown below.  Does that help?
> > > > > 
> > > > > 							Thanx, Paul
> > > > > 
> > > > > ------------------------------------------------------------------------
> > > > > 
> > > > > commit 36fe14d5ebe406e331a5d89533fe3187d2019898
> > > > > Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > > > > Date:   Sun Apr 17 10:41:33 2022 -0700
> > > > > 
> > > > >     appendix/whymb: Clarify QQ C.8
> > > > >     
> > > > >     More clearly note the presence of data other than the variable a.
> > > > >     
> > > > >     Reported-by: Hao Lee <haolee.swjtu@xxxxxxxxx>
> > > > >     Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > > > > 
> > > > > diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex
> > > > > index 8f607e35..43f1307b 100644
> > > > > --- a/appendix/whymb/whymemorybarriers.tex
> > > > > +++ b/appendix/whymb/whymemorybarriers.tex
> > > > > @@ -821,9 +821,14 @@ Then the sequence of operations might be as follows:
> > > > >  	In \cref{seq:app:whymb:Store Buffers and Memory Barriers} above,
> > > > >  	why does CPU~0 need to issue a ``read invalidate''
> > > > >  	rather than a simple ``invalidate''?
> > > > > +	After all, \co{foo()} will overwrite \co{a} in any case, so why
> > > > > +	should it care about the old value of \co{a}?
> > > > 
> > > > Totally clear!
> > > > 
> > > > And we may also need to add some details to C.3.1:
> > > > 
> > > > 	With the addition of these store buffers, CPU 0 can simply
> > > > 	record its write in its store buffer and continue executing.
> > > > 	When the cache line does finally make its way from CPU 1 to CPU
> > > > 	0, the data will be moved from the store buffer to the cache
> > > > 	line.
> > > > 
> > > > This passage explains why we need a store buffer, but I think the data
> > > > in store buffer won't be moved directly to the cache line.
> > > > Instead, the store buffer must be merged with the cache line responded
> > > > by CPU1, and only after that can it be moved to CPU0's cache line.
> > > 
> > > You lost me here.
> > > 
> > > Ah, maybe the missing point is that store buffers do not necessarily
> > > maintain full cache lines, but only the data that was actually stored.
> > 
> > Yes! This is exactly what I want to say. I don't find any hardware sheet
> > that illustrates the details, but I think the following process may be
> > reasonable:
> > 
> > The memory data from address 0x0~0xf only exists in CPU1's cache line,
> > and now CPU0 wants to write a byte at address 0x0. CPU0 write the _byte_
> > into its store buffer and send a "read invalidate" message to CPU1. When
> > CPU0 receives the whole cache line responded by CPU1, it needs to
> > overwrite the first byte of the responded cache line with the byte in
> > store buffer, leaving the other 15 bytes untouched. And then, the
> > "merged" cache line can be moved to CPU0's cache.
> 
> How about as in the commit shown below?
> 
> > > Or, if the store buffer does contain full cache lines, it also contains
> > > a mask to indicate what portions of the cache line need to be updated.
> > 
> > I think this scenario seems impossible because CPU0 doesn't have the
> > content of the target cache line, and it can only record changed bytes
> > in store buffer.
> 
> Well, there are many ways to record changed bytes.  One way would be
> to have eache store-buffer entry have double the bits of a cache line,
> so that if each cache line is 64 bits, each store-buffer entry has
> 128 bits.  64 of those bits record the recently stored values, with
> don't-care bits for any portions of that cache line that have not been
> recently stored to by this CPU.  The other 64 bits are set to the value
> 1 if the corresponding bit has recently been stored to, and set to the
> value zero otherwise.
> 
> The obvious disadvantage of this approach is the larger size of each
> store-buffer entry.  The corresponding advantage is that the common
> case of consecutive stores can usually be merged into a single
> store-buffer entry.

Thanks for elaborating on these details! Pretty clear!

> 
> Again, how about the commit shown below?
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit 475cc7fa460f60b0e518808c68890c8d63658d1c
> Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
> Date:   Wed Apr 20 10:50:59 2022 -0700
> 
>     appendix/whymb: Store buffers and partial cache lines
>     
>     Reported-by: Hao Lee <haolee.swjtu@xxxxxxxxx>
>     Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
> 
> diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex
> index aeaa4291..347635a4 100644
> --- a/appendix/whymb/whymemorybarriers.tex
> +++ b/appendix/whymb/whymemorybarriers.tex
> @@ -12,7 +12,10 @@ So what possessed CPU designers to cause them to inflict \IXBpl{memory barrier}
>  on poor unsuspecting SMP software designers?
>  
>  In short, because reordering memory references allows much better performance,
> -and so memory barriers are needed to force ordering in things like
> +courtesy of the finite speed of light and the non-zero size of atoms
> +noted in \cref{sec:cpu:Overheads}, and particularly in the
> +hardware-performance question posed by \QuickQuizRef{\QspeedOfLightAtoms}.
> +Therefore, memory barriers are needed to force ordering in things like
>  synchronization primitives whose correct operation depends on ordered
>  memory references.
>  
> @@ -658,16 +661,55 @@ When the cache line does finally make its way from CPU~1 to CPU~0,
>  the data will be moved from the store buffer to the cache line.
>  
>  \QuickQuiz{
> -	But if the main purpose of store buffers is to hide acknowledgment
> -	latencies in multiprocessor cache-coherence protocols, why
> -	do uniprocessors also have store buffers?
> +	But then why do uniprocessors also have store buffers?
>  }\QuickQuizAnswer{
>  	Because the purpose of store buffers is not just to hide
>  	acknowledgement latencies in multiprocessor cache-coherence protocols,
>  	but to hide memory latencies in general.
>  	Because memory is much slower than is cache on uniprocessors,
>  	store buffers on uniprocessors can help to hide write-miss
> -	latencies.
> +	memory latencies.
> +}\QuickQuizEnd
> +
> +Please note that the store buffer does not necessarily operate on
> +full cache lines.
> +The reason for this is that a given store-buffer entry need only contain
> +the value stored, not the other data contained in the corresponding
> +cache line.
> +Which is a good thing, because the CPU doing the store has no idea
> +what that other data might be!
> +But once the corresponding cache line arrives, any values from the
> +store buffer that update that cache line can be merged into it,
> +and the corresponding entries can then be removed from the store buffer.
> +Any other data in that cache line is of course left intact.
> +
> +\QuickQuiz{
> +	So store-buffer entries are variable length?
> +	Isn't that difficult to implement in hardware?
> +}\QuickQuizAnswer{
> +	Here are two ways for hardware to easily handle variable-length
> +	stores.
> +
> +	First, each store-buffer entry could be a single byte wide.
> +	Then an 64-bit store would consume eight store-buffer entries.
> +	This approach is simple and flexible, but one disadvantage is
> +	that each entry would need to replicate much of the address that
> +	was stored to.
> +
> +	Second, each store-buffer entry could be double the size of a
> +	cache line, with half of the bits containing the values stored,
> +	and the other half indicating which bits had been stored to.
> +	So, assuming a 32-bit cache line, a single-byte store of 0x5a
> +	to the low-order byte of a given cache line would result in
> +	\co{0xXXXXXX5a} for the first half and \co{0x000000ff} for the
> +	second half, where the values labeled \co{X} are arbitrary
> +	because they would be ignored.
> +	This approach allows multiple consecutive stores corresponding to
> +	a given cache line to be merged into a single store-buffer entry,
> +	but is space-inefficient for random stores of single bytes.

This commit and these passages have clarified everything!
Thank you for your hard work!

Regards,
Hao Lee

> +
> +	Much more complex and efficient schemes are of course used
> +	by actual hardware designers.
>  }\QuickQuizEnd
>  
>  \begin{figure}
> diff --git a/cpu/overheads.tex b/cpu/overheads.tex
> index b8a65faa..c9f5f1f7 100644
> --- a/cpu/overheads.tex
> +++ b/cpu/overheads.tex
> @@ -425,6 +425,8 @@ thousand clock cycles.
>  	able to do to ease the plight of parallel programmers.
>  }\QuickQuizEnd
>  
> +\QuickQuizLabel{\QspeedOfLightAtoms}
> +
>  \begin{table}
>  \rowcolors{1}{}{lightgray}
>  \renewcommand*{\arraystretch}{1.1}