On Thu, 21 Oct 2010, Gleb O. Raiko wrote: > > I'm not sure what you mean: whether the processor will snoop the value to > > read in the store buffer or will it stall until the buffer has drained and > > issue the load on the external bus? > I meant the latter. OK, I hoped so, but just double-checked to be sure. :) > > Therefore in the context of one or more pending uncached stores I can > > assume one of the three for an uncached load: > > > > 1. If the addresses match, then the value loaded is snooped in (retrieved > > from) the store buffer, no external cycle on the bus is seen. This is > > what the R2020 WB did. > > > > 2. The load bypasses the stores and therefore reaches the external bus > > before the stores. This is what the R3220 MB did and I believe the > > R2020 WB defaulted to in the case of no address match. > > > > 3. The load stalls until the outstanding stores have completed and only > > then appears on the external bus. > > > > There's no hurt from using SYNC here and its semantics make it clear it > > enforces the case #3 above even if not otherwise guaranteed. Otherwise I > > think the case #2 would be a reasonable default (i.e. one I'd recommend to > > a processor designer) as draining the store buffer on any uncached load > > whether needed or not is a waste of performance. > There is no such thing like performance in case of uncached loads. > The case #2 requires: > 1. sync > 2. additional operations (usually just a read) to pull data behind input > buffers on an IO bus. When talking to MMIO you often don't need to force the outstanding writes to complete before you exit some driver's code. They will eventually reach the device and to their things in due course. A notable exception are some kinds of side effects that need to be synchronised to prevent races. For example to avoid wasting processing time for handling spurious interrupts you do want to make sure a write that acknowledges a pending interrupt has been recorded by the handler reaches the respective device's register before the interrupt has been cleared in the interrupt controller. On the other hand you do not need to issue a writeback of a request for the device to look for more data in the outgoing DMA descriptor ring. > While it's ok to put that in MMIO reads/writes as you've done, it's almost > impossible to program X server in that way, for example. This beast considers > a frame buffer as an memory array with strong ordering. That's why I'd vote > for the case #3. Not because it outperforms #2 in the real life (who cares for > 0.0001% gain), but because IO devices requires strong ordering. Ah, framebuffers. The DEC Alpha people somehow managed to get them right. :) What you say is of course true for a dumb framebuffer -- but who cares about dumb framebuffers these days? A half-decent graphics controller will provide a set of typical masked raster operations: STORE, AND, OR, XOR, etc. so that you don't have to issue RMW cycles to framebuffer's memory -- all you need are bulk writes, where the order does not really matter and which can be pipelined (the graphics controller may be able to replicate writes too, such as across the whole scanline -- good for the bandwidth!). You may still have to issue some barriers around accesses to framebuffer's control registers, but that's about it. And the TGA X11 driver undobtedly gets these things right or otherwise nobody could have used it and the adapters it supports with an Alpha (as a side note: that graphics chip/software applies to MIPS-based DECstation systems too). This is all early 1990s' technology, no rocket science anymore. :) There's a technical report on the techniques used somewhere on the web -- look for "Smart Frame Buffer" (and don't forget to check its date ;) ). In general: don't break the CPU because you've got a broken piece of software -- fix the piece instead! I stand by my choice -- inefficiency from unnecessary (implicit) ordering barriers accumulates. These operations are so slow (with latencies possibly counted in hundreds of CPU cycles) it really matters whether you need ten or just one, especially with the speeds of contemporary processors. Maciej