Re: Performance bug in c-r4k.c cache handling code

"Maciej W. Rozycki" <macro@xxxxxxxxxxxxxx> · Tue, 20 Sep 2005 13:37:33 +0100 (BST)

On Tue, 20 Sep 2005, Dominic Sweetman wrote:

> > > I found an performance bug in c-r4k.c:r4k_dma_cache_inv, where a
> > > Hit_Writeback_Inv instead of Hit_Invalidate is done.
> 
> The MIPS64 spec (which is really all there is to set standards in this
> area) regards Hit_Invalidate as optional.  So it would be nice not to
> use it.  CPUs have no standard "configuration" register you can read
> to establish which cacheops work, so to identify capable CPUs you must
> use a table of CPU attributes indexed by the CPU ID, which encourages
> the crime of building software which can't possibly run on a new CPU...

 Or just using the safe fallback -- that shouldn't be a problem (these 
functions are called indirectly).  Besides new CPUs more often than not 
require changes to kernel-level software anyway.

> So long as the buffer is in fact clean, then in most implementations a
> Hit_Writeback_Invalidate will be just as efficient.

 I hope so, but who knows what's wired there in all those old systems?...

> I suppose where DMA data subsequently gets decorated by the CPU then
> handed on to some other layer, then the buffer is freed...?

 I don't think the buffer is modified, so cache lines should remain clean. 
For the usual case of IP data is used exactly once for copy_and_csum() 
(more or less) which moves it to another buffer.

> > FYI, for R4k DECstations the need to flush the cache for newly allocated 
> > skbs reduces throughput of FDDI reception by about a half (!), down from 
> > about 90Mbps (that's for the /260)...
> 
> How did you measure the high throughput?  Have you got a
> machine with DMA-coherency you can turn on and off?

 I just disabled invalidations. ;-)  Yes, that resulted in some corrupt 
data, but it was good enough to do benchmarking.  That was an R4400 with 
1MB of S-cache.

 Eventually I should benchmark both invalidation variations against each 
other with the system in question and see if it makes any difference.  
Ironically this is where the write-back cache of the R4k gives loss rather 
than gain as compared to the write-through cache of the R3k (the system 
supports daughtercards with either CPU, so useful comparison is possible) 
as for the former I have to invalidate cache spanning the whole 
newly-allocated buffer, i.e. ~4.5kB, while for the latter I may invalidate 
only the area actually used, once a frame has been received, its length is 
known and quite often much smaller than the maximum (especially if it's 
been routed from a network that has a smaller frame length limit, like 
Ethernet).

  Maciej