On Mon, Jun 9, 2014 at 7:41 PM, Alexandre Courbot <gnurou@xxxxxxxxx> wrote: > On Mon, May 19, 2014 at 6:22 PM, Lucas Stach <l.stach@xxxxxxxxxxxxxx> wrote: >> Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding: >>> On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote: >>> > Some architectures (e.g. ARM) need the CPU buffers to be explicitely >>> > flushed for a memory write to take effect. Not doing so results in >>> > synchronization issues, especially after writing to BOs. >>> >>> It seems to me that the above is generally true for all architectures, >>> not just ARM. >>> >> No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will >> snoop the CPU caches and therefore an explicit cache flush is not >> required. >> >>> Also: s/explicitely/explicitly/ >>> >>> > This patch introduces a macro that flushes the caches on ARM and >>> > translates to a no-op on other architectures, and uses it when >>> > writing to in-memory BOs. It will also be useful for implementations of >>> > instmem that access shared memory directly instead of going through >>> > PRAMIN. >>> >>> Presumably instmem can access shared memory on all architectures, so >>> this doesn't seem like a property of the architecture but rather of the >>> memory pool backing the instmem. >>> >>> In that case I wonder if this shouldn't be moved into an operation that >>> is implemented by the backing memory pool and be a noop where the cache >>> doesn't need explicit flushing. >>> >>> > diff --git a/drivers/gpu/drm/nouveau/core/os.h b/drivers/gpu/drm/nouveau/core/os.h >>> > index d0ced94ca54c..274b4460bb03 100644 >>> > --- a/drivers/gpu/drm/nouveau/core/os.h >>> > +++ b/drivers/gpu/drm/nouveau/core/os.h >>> > @@ -38,4 +38,21 @@ >>> > #endif /* def __BIG_ENDIAN else */ >>> > #endif /* !ioread32_native */ >>> > >>> > +#if defined(__arm__) >>> > + >>> > +#define nv_cpu_cache_flush_area(va, size) \ >>> > +do { \ >>> > + phys_addr_t pa = virt_to_phys(va); \ >>> > + __cpuc_flush_dcache_area(va, size); \ >>> > + outer_flush_range(pa, pa + size); \ >>> > +} while (0) >>> >>> Couldn't this be a static inline function? >>> >>> > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c >>> [...] >>> > index 0886f47e5244..b9c9729c5733 100644 >>> > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c >>> > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c >>> > @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val) >>> > mem = &mem[index]; >>> > if (is_iomem) >>> > iowrite16_native(val, (void __force __iomem *)mem); >>> > - else >>> > + else { >>> > *mem = val; >>> > + nv_cpu_cache_flush_area(mem, 2); >>> > + } >>> > } >>> > >>> > u32 >>> > @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val) >>> > mem = &mem[index]; >>> > if (is_iomem) >>> > iowrite32_native(val, (void __force __iomem *)mem); >>> > - else >>> > + else { >>> > *mem = val; >>> > + nv_cpu_cache_flush_area(mem, 4); >>> > + } >>> >>> This looks rather like a sledgehammer to me. Effectively this turns nvbo >>> into an uncached buffer. With additional overhead of constantly flushing >>> caches. Wouldn't it make more sense to locate the places where these are >>> called and flush the cache after all the writes have completed? >>> >> I don't think the explicit flushing for those things makes sense. I >> think it is a lot more effective to just map the BOs write-combined on >> PCI non-coherent arches. This way any writes will be buffered. Reads >> will be slow, but I don't think nouveau is reading back a lot from those >> buffers. >> Using the write-combining buffer doesn't need any additional >> synchronization as it will get flushed on pushbuf kickoff anyways. > > I tried to go that way, and something interesting happened. > > What I did: remove this patch and instead set the following caching > parameters for the TTM_PL_TT case in nouveau_bo_init_mem_type(): > > man->available_caching = TTM_PL_FLAG_UNCACHED | TTM_PL_FLAG_WC; > man->default_caching = TTM_PL_FLAG_WC; > > What happened: no runtime errors as what happened when caching is > enabled. However, many of the vertex and texture buffers seem to be > partially corrupted. In glmark2 the 3d models had many vertices (but > not all) at the wrong position. Note that not all the scenes ended up > being corrupted - in particular, when two consecutive scenes used the > same model, the second instance would be uncorrupted. > > Forcing the caching to TTM_PL_FLAG_UNCACHED led to the same result. > What is interesting is that while data like vertices and textures got > corrupted, pushbuffers and shader programs seem to be just fine, as I > could not see any runtime error. An interesting fact: if I change ttm_bo_kmap_ttm() such as kernel mappings of BOs are always performed write-combined, and leave the TTM_PL_TT default caching to TTM_PL_FLAG_CACHED so user-space mappings remain cached, the corruptions just vanish. It seems to be the fact of setting user-space mappings to anything non-cached that leads to this puzzling behavior. Certainly some subtlety of ARM mappings are getting over my head here. If we need to implement different policies for kernel and user-space mappings, this might complicate things a bit, especially since support needs to be in TTM and not only Nouveau. I will submit a RFC tomorrow if I don't hear better ideas by then. Alex. -- To unsubscribe from this list: send the line "unsubscribe linux-tegra" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html