On Tue, 1 Apr 2014, Maciej W. Rozycki wrote: > > When support for the DECStation is enabled, it will default to use a > > MIPS R3000 class processor. This will cause an intentional build failure > > to popup because MIPS_L1_CACHE_SHIFT and cpu_dcache_line_size() > > disagree. Fix this by selecting MIPS_L1_CACHE_SHIFT_2 when we build > > targetting a MIPS R3000 CPU to fix that build failure and satisfy all > > requirements. > > > > Signed-off-by: Florian Fainelli <florian@xxxxxxxxxxx> > > Acked-by: Maciej W. Rozycki <macro@xxxxxxxxxxxxxx> > > This actually boots -- Ralf, please apply. Having done further investigation I need to withdraw my ack; I see these patches went nowhere so far, so please keep the status quo. The thing is while the size of an individual cache entry (i.e. data+tag) is indeed 4 bytes on the R2000 and R3000 DECstations their cache controllers do not necessarily operate on single entries only. Some models do fills on multiple aligned entries at once. So while the stride of 4 bytes is adequate for invalidation, it is not necessarily so for good performance. Specifically: * in DECstation 2100 and 3100 systems [1]: "The CPU maintains the direct-mapped instruction cache and the direct-mapped, write-through data cache. Each cache is 64 KBytes in capacity with a 4-byte line size." * in DECstation 5000/200 systems [2]: "The instruction and data caches are configured with a four-word line size with loads and stores nominally completing in one cycle. Instruction and data cache fills take advantage of page mode memory cycles to complete a four-word fill in 11 access latency cycles, 4 data transfer cycles, plus miss and memory latency overhead. This results in a peak memory read bandwidth of 21 MBytes/second with a 25 MHz system clock." * in DECstation 5000/120, 5000/125 and Personal DECstation 5000/20 and 5000/125 systems (CPU daughtercards are interchangeable between these systems) [4]: "The CPU subsystem contains 64 KB each of instruction cache and data cache. The caches are direct-mapped, write-through caches, each containing 16K word entries. A cache word entry contains 32 bits of instruction or data, 13 tag bits, a valid flag bit, and byte-parity bits. The tag bits hold the high-order part of the physical address in system memory of the cached word. The low-order bits of the system memory address of the cached word are the same as its address in the cache; they form the cache index. The dual cache is implemented in fast SRAM. The R3000A can fetch one instruction and load one data word in each cycle." * in DECstation 5000/240 systems [3]: "The caches are direct-mapped, write-through caches, each containing 16K word entries. A cache word entry contains 32 bits of instruction or data, 16 tag bits, a valid flag bit, and byte-parity bits. The tag bits hold the high-order part of the physical address of the cached word in system memory. The low-order bits of the system memory address of the cached word are the same as its address in the cache; they form the cache index. (Physically, each cache entry contains a total of 60 bits; the unused bits are additional tag and parity bits needed in implementations with smaller caches.) [...] "A cache load fills eight consecutive cache words on an eight-word boundary. The MB contains dual eight-word buffers -- a read buffer and a prefetch buffer. For a cache load, the MB performs a page-mode read from memory to fill its read buffer, at one word per 40-ns memory system cycle after the 8-cycle page mode read latency. When the read buffer is full, the MB writes the eight locations to cache, in eight 25-ns CPU/cache cycles. When the cache line is on a 16-word boundary, the MB also fills the prefetch buffer, so that the next cache line can be available for a subsequent cache load without referencing system memory (unless one of the prefetched words is invalidated by a processor write to the location)." Our code in r3k_cache_lsize only calculates how many bytes in the cache get invalidated at a time. That's of course useful for optimising cache invalidations (that we don't do at the moment anyway), but has nothing to do with the optimising for cache prefetches. A different sizing algorithm would have to be used -- not that difficult to invent too, and maybe worth adding for informational purposes if nothing else. All in all it looks to me like not only MIPS_L1_CACHE_SHIFT_2 shouldn't be set for R2000 and R3000 DECstations, but MIPS_L1_CACHE_SHIFT_4 shouldn't be either. Instead MIPS_L1_CACHE_SHIFT_6 looks like the right choice for good performance with the DECstation 5000/240 system since we don't handle individual family members with separate configurations (MIPS_L1_CACHE_SHIFT_5 would do for the 5000/200). R4k DECstations would remain using MIPS_L1_CACHE_SHIFT_4, although it is quite possible that the MB chip they also have does similar prefetching for their secondary cache (there's that mysterious PF bit in its control and status register). References: [1] Workstation Systems Engineering: "DECstation 3100 Desktop Workstation Functional Specification", Revision 1.3, August 28, 1990, Digital Equipment Corporation, section 6.1: "Processor", p. 4. [2] Workstation Systems Engineering: "DECstation 5000/200 KN02 System Module Functional Specification", Revision 1.3, August 27, 1990, Digital Equipment Corporation, section 4.3: "Processor Subsystem", p. 4. [3] Worksystems Base Product Marketing: "Personal DECstation Series Technical Overview", Version 1.0, December, 1991, Digital Equipment Corporation, section 2.2.3: "The Personal DECstation 5000 CPU Subsystem", p. 8. [4] Worksystems Base Product Marketing: "DECstation 5000 Model 240 Workstation Technical Overview", Version 1.0, December, 1991, Digital Equipment Corporation, section 2.2.4: "Cache Architecture, Implementation, and Operation", pp. 8-9. Maciej