Hi, Here is a proposal for a software workaround to speculative execution on a non-coherent system such as the i2 R10k and the o2 R10k. 1. Problem: =========== The R10000 processor can (and will) execute intructions ahead. These instructions will be cancelled if they're not supposed to execute, e.g. if a jump happened. If a load or store instruction is executed speculatively, and the accessed memory is not in the cache, the cache line will be fetched in main memory and, on a store, be marked dirty. These speculative loads and stores can happen anywhere, since there might be old values in registers used in a speculative load/store instruction that would be cancelled afterwards. The problem is: - on a speculative load, the fetched cache line will remain in the cache even if the speculative load is cancelled - on a speculative store, the *dirty* cache line will remain in the cache even if the speculative store is cancelled On non-coherent systems we need to flush the cache lines to main memory before doing DMA to device, so that the device can see them. We also need to invalidate lines before reading from a DMA'd buffer to make sure the CPU will read main memory and not the cache. However, if a speculative load or store happens during DMA transfer, the cache line will be fetched from memory and, on a store, be marked dirty. That means this cache line could be evicted when the line is needed, thus being written back in memory if it was dirty, thus overwritting the data a device could have put in the DMA buffer. Something we really don't want to happen ;) 2. Proposed solution ==================== Speculative execution will not happen in the following conditions: - access to memory is uncached - the speculated instruction causes an exception: that also means a speculative load/store will not happen in a mapped memory region which doesn't have a TLB line for it. This second point means that any mapped space can be made safe by removing the DMA'd buffer address translations from the TLB or by marking them 'uncached' during DMA transfer. The remaining unmapped adress spaces are: - kseg1, which is safe since uncached - kseg0, which can turned uncached with the K0 bits from the CPO Config register - xkphys which will cause adress error if the KX bit is not set, the aborting the speculative load/store before it can do harm ;) Since we need to turn KX off, xkseg will not be accessible either.. and since we need to have KSEG0 uncached, we need to remap the kernel elsewhere if we want performance ;). We could use the xsseg segment, available in Supervisor mode, which is mapped (safe) and moreover allows to access all memory (on o2 it can be up to 2G I think, whereas in 32bit mode, only 512Mb would be accessible). So the proposed workaround is to permanently map the lower 16MB of memory in xsseg in using a wired TLB entry and a page size of 16MB. This memory would not be usable for DMA. Everything else would, so we could for example reserve the upper 16Mb for DMA (and give them to the DMA zoned memory allocator). On exception or error, the handler (in KSEG0) would set CU0 to allow access to CPO, then switch to Supervisor mode and jump to the equivalent xsseg location and continue execution in Supervisor mode. The code for returning to userland would need to clear the CU0 bit, to prevent user access to CP0. Before DMA transfer, the DMA'd buffer cache lines would be flushed, and then it would be remapped 'uncached', thus preventing that any speculative load or store to this memory happens during transfer. After the DMA transfer, the cache would be invalidated to make sure main memory is read, and the DMA buffer would be remapped 'cacheable non-coherent'. A diagram is attached to illustrate the workaround. Comments, suggestions (and even flames) are welcome before anyone starts coding the workaround ;) regards, Vivien Chappelier.
Attachment:
R10k_coherency_workaround.ps
Description: PostScript document