On Wed, Apr 20, 2022 at 11:15:53AM -0700, Paul E. McKenney wrote: > On Wed, Apr 20, 2022 at 06:45:23AM +0000, Hao Lee wrote: > > On Tue, Apr 19, 2022 at 10:28:50AM -0700, Paul E. McKenney wrote: > > > On Mon, Apr 18, 2022 at 08:01:17AM +0000, Hao Lee wrote: > > > > On Sun, Apr 17, 2022 at 10:44:54AM -0700, Paul E. McKenney wrote: > > > > > On Thu, Apr 14, 2022 at 05:42:25PM +0000, Hao Lee wrote: > > > > > > Hi, > > > > > > > > > > > > At the beginning of C.3.3 we have supposed the cache line containing "a" > > > > > > resides _only_ in _CPU1’s_ cache. I think this is why _CPU0_ has to send > > > > > > a "_read_ invalidate message" to _retrieve_ the cache line and invalid > > > > > > CPU1's cache line. > > > > > > > > > > > > However, the answer says the reason is the cache line in question > > > > > > contains more than just the variable a. I can't understand the logical > > > > > > relationship between this answer and the question. Am I missing > > > > > > something here? Thanks. > > > > > > > > > > I added the commit shown below. Does that help? > > > > > > > > > > Thanx, Paul > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > > > commit 36fe14d5ebe406e331a5d89533fe3187d2019898 > > > > > Author: Paul E. McKenney <paulmck@xxxxxxxxxx> > > > > > Date: Sun Apr 17 10:41:33 2022 -0700 > > > > > > > > > > appendix/whymb: Clarify QQ C.8 > > > > > > > > > > More clearly note the presence of data other than the variable a. > > > > > > > > > > Reported-by: Hao Lee <haolee.swjtu@xxxxxxxxx> > > > > > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx> > > > > > > > > > > diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex > > > > > index 8f607e35..43f1307b 100644 > > > > > --- a/appendix/whymb/whymemorybarriers.tex > > > > > +++ b/appendix/whymb/whymemorybarriers.tex > > > > > @@ -821,9 +821,14 @@ Then the sequence of operations might be as follows: > > > > > In \cref{seq:app:whymb:Store Buffers and Memory Barriers} above, > > > > > why does CPU~0 need to issue a ``read invalidate'' > > > > > rather than a simple ``invalidate''? > > > > > + After all, \co{foo()} will overwrite \co{a} in any case, so why > > > > > + should it care about the old value of \co{a}? > > > > > > > > Totally clear! > > > > > > > > And we may also need to add some details to C.3.1: > > > > > > > > With the addition of these store buffers, CPU 0 can simply > > > > record its write in its store buffer and continue executing. > > > > When the cache line does finally make its way from CPU 1 to CPU > > > > 0, the data will be moved from the store buffer to the cache > > > > line. > > > > > > > > This passage explains why we need a store buffer, but I think the data > > > > in store buffer won't be moved directly to the cache line. > > > > Instead, the store buffer must be merged with the cache line responded > > > > by CPU1, and only after that can it be moved to CPU0's cache line. > > > > > > You lost me here. > > > > > > Ah, maybe the missing point is that store buffers do not necessarily > > > maintain full cache lines, but only the data that was actually stored. > > > > Yes! This is exactly what I want to say. I don't find any hardware sheet > > that illustrates the details, but I think the following process may be > > reasonable: > > > > The memory data from address 0x0~0xf only exists in CPU1's cache line, > > and now CPU0 wants to write a byte at address 0x0. CPU0 write the _byte_ > > into its store buffer and send a "read invalidate" message to CPU1. When > > CPU0 receives the whole cache line responded by CPU1, it needs to > > overwrite the first byte of the responded cache line with the byte in > > store buffer, leaving the other 15 bytes untouched. And then, the > > "merged" cache line can be moved to CPU0's cache. > > How about as in the commit shown below? > > > > Or, if the store buffer does contain full cache lines, it also contains > > > a mask to indicate what portions of the cache line need to be updated. > > > > I think this scenario seems impossible because CPU0 doesn't have the > > content of the target cache line, and it can only record changed bytes > > in store buffer. > > Well, there are many ways to record changed bytes. One way would be > to have eache store-buffer entry have double the bits of a cache line, > so that if each cache line is 64 bits, each store-buffer entry has > 128 bits. 64 of those bits record the recently stored values, with > don't-care bits for any portions of that cache line that have not been > recently stored to by this CPU. The other 64 bits are set to the value > 1 if the corresponding bit has recently been stored to, and set to the > value zero otherwise. > > The obvious disadvantage of this approach is the larger size of each > store-buffer entry. The corresponding advantage is that the common > case of consecutive stores can usually be merged into a single > store-buffer entry. Thanks for elaborating on these details! Pretty clear! > > Again, how about the commit shown below? > > Thanx, Paul > > ------------------------------------------------------------------------ > > commit 475cc7fa460f60b0e518808c68890c8d63658d1c > Author: Paul E. McKenney <paulmck@xxxxxxxxxx> > Date: Wed Apr 20 10:50:59 2022 -0700 > > appendix/whymb: Store buffers and partial cache lines > > Reported-by: Hao Lee <haolee.swjtu@xxxxxxxxx> > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx> > > diff --git a/appendix/whymb/whymemorybarriers.tex b/appendix/whymb/whymemorybarriers.tex > index aeaa4291..347635a4 100644 > --- a/appendix/whymb/whymemorybarriers.tex > +++ b/appendix/whymb/whymemorybarriers.tex > @@ -12,7 +12,10 @@ So what possessed CPU designers to cause them to inflict \IXBpl{memory barrier} > on poor unsuspecting SMP software designers? > > In short, because reordering memory references allows much better performance, > -and so memory barriers are needed to force ordering in things like > +courtesy of the finite speed of light and the non-zero size of atoms > +noted in \cref{sec:cpu:Overheads}, and particularly in the > +hardware-performance question posed by \QuickQuizRef{\QspeedOfLightAtoms}. > +Therefore, memory barriers are needed to force ordering in things like > synchronization primitives whose correct operation depends on ordered > memory references. > > @@ -658,16 +661,55 @@ When the cache line does finally make its way from CPU~1 to CPU~0, > the data will be moved from the store buffer to the cache line. > > \QuickQuiz{ > - But if the main purpose of store buffers is to hide acknowledgment > - latencies in multiprocessor cache-coherence protocols, why > - do uniprocessors also have store buffers? > + But then why do uniprocessors also have store buffers? > }\QuickQuizAnswer{ > Because the purpose of store buffers is not just to hide > acknowledgement latencies in multiprocessor cache-coherence protocols, > but to hide memory latencies in general. > Because memory is much slower than is cache on uniprocessors, > store buffers on uniprocessors can help to hide write-miss > - latencies. > + memory latencies. > +}\QuickQuizEnd > + > +Please note that the store buffer does not necessarily operate on > +full cache lines. > +The reason for this is that a given store-buffer entry need only contain > +the value stored, not the other data contained in the corresponding > +cache line. > +Which is a good thing, because the CPU doing the store has no idea > +what that other data might be! > +But once the corresponding cache line arrives, any values from the > +store buffer that update that cache line can be merged into it, > +and the corresponding entries can then be removed from the store buffer. > +Any other data in that cache line is of course left intact. > + > +\QuickQuiz{ > + So store-buffer entries are variable length? > + Isn't that difficult to implement in hardware? > +}\QuickQuizAnswer{ > + Here are two ways for hardware to easily handle variable-length > + stores. > + > + First, each store-buffer entry could be a single byte wide. > + Then an 64-bit store would consume eight store-buffer entries. > + This approach is simple and flexible, but one disadvantage is > + that each entry would need to replicate much of the address that > + was stored to. > + > + Second, each store-buffer entry could be double the size of a > + cache line, with half of the bits containing the values stored, > + and the other half indicating which bits had been stored to. > + So, assuming a 32-bit cache line, a single-byte store of 0x5a > + to the low-order byte of a given cache line would result in > + \co{0xXXXXXX5a} for the first half and \co{0x000000ff} for the > + second half, where the values labeled \co{X} are arbitrary > + because they would be ignored. > + This approach allows multiple consecutive stores corresponding to > + a given cache line to be merged into a single store-buffer entry, > + but is space-inefficient for random stores of single bytes. This commit and these passages have clarified everything! Thank you for your hard work! Regards, Hao Lee > + > + Much more complex and efficient schemes are of course used > + by actual hardware designers. > }\QuickQuizEnd > > \begin{figure} > diff --git a/cpu/overheads.tex b/cpu/overheads.tex > index b8a65faa..c9f5f1f7 100644 > --- a/cpu/overheads.tex > +++ b/cpu/overheads.tex > @@ -425,6 +425,8 @@ thousand clock cycles. > able to do to ease the plight of parallel programmers. > }\QuickQuizEnd > > +\QuickQuizLabel{\QspeedOfLightAtoms} > + > \begin{table} > \rowcolors{1}{}{lightgray} > \renewcommand*{\arraystretch}{1.1}