On Thu, 2014-09-04 at 11:34 +0200, Daniel Vetter wrote: > On Thu, Sep 04, 2014 at 09:44:04AM +0200, Thomas Hellstrom wrote: > > Last time I tested, (and it seems like Michel is on the same track), > > writing with the CPU to write-combined memory was substantially faster > > than writing to cached memory, with the additional side-effect that CPU > > caches are left unpolluted. > > > > Moreover (although only tested on Intel's embedded chipsets), texturing > > from cpu-cache-coherent PCI memory was a real GPU performance hog > > compared to texturing from non-snooped memory. Hence, whenever a buffer > > could be classified as GPU-read-only (or almost at least), it should be > > placed in write-combined memory. > > Just a quick comment since this explicitly referes to intel chips: On > desktop/laptop chips with the big shared l3/l4 caches it's the other way > round. Cached uploads are substantially faster than wc and not using > coherent access is a severe perf hit for texturing. I guess the hw guys > worked really hard to hide the snooping costs so that the gpu can benefit > from the massive bandwidth these caches can provide. This is similar to modern POWER chips as well. We have pretty big L3's (though not technically shared they are in a separate quadrant and we have a shared L4 in the memory buffer) and our fabric is generally optimized for cachable/coherent access performance. In fact, we only have so many credits for NC accesses on the bus... What that tells me is that when setting up the desired cachability attributes for the mapping of a memory object, we need to consider these things here: - The hard requirement of the HW (non-coherent GPUs require NC, AGP does in some cases, etc...) which I think is basically already handled using the placement attributes set by the GPU driver for the memory type - The optimal attributes (and platform hard requirements) for fast memory accesses to an object by the processor. From what I read here, this can be NC+WC on older Intel, cachable on newer, etc...) - The optimal attributes for fast GPU DMA accesses to the object in system memory. Here too, this is fairly platform/chipset dependent. Do we have flags in the DRM that tell us whether an object in memory is more likely to be used by the GPU via DMA vs by the CPU via MMIO ? On powerpc (except in the old AGP case), I wouldn't care about require cachable in both case, but I can see the low latency crowd wanting the former to be non-cachable while the dumb GPUs like AST who don't do DMA would benefit greatly from the latter... Cheers, Ben. _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel