On Tue, 2022-03-22 at 11:20 +0000, Tvrtko Ursulin wrote: > > On 22/03/2022 10:26, Thomas Hellström wrote: > > On Tue, 2022-03-22 at 10:13 +0000, Tvrtko Ursulin wrote: > > > > > > On 21/03/2022 15:15, Thomas Hellström wrote: > > > > On Mon, 2022-03-21 at 14:43 +0000, Tvrtko Ursulin wrote: > > > > > > > > > > On 21/03/2022 13:40, Thomas Hellström wrote: > > > > > > Hi, > > > > > > > > > > > > On Mon, 2022-03-21 at 13:12 +0000, Tvrtko Ursulin wrote: > > > > > > > > > > > > > > On 21/03/2022 12:33, Thomas Hellström wrote: > > > > > > > > On Mon, 2022-03-21 at 12:22 +0000, Tvrtko Ursulin > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > On 21/03/2022 11:03, Thomas Hellström wrote: > > > > > > > > > > Hi, Tvrtko. > > > > > > > > > > > > > > > > > > > > On 3/21/22 11:27, Tvrtko Ursulin wrote: > > > > > > > > > > > > > > > > > > > > > > On 19/03/2022 19:42, Michael Cheng wrote: > > > > > > > > > > > > To align with the discussion in [1][2], this > > > > > > > > > > > > patch > > > > > > > > > > > > series > > > > > > > > > > > > drops > > > > > > > > > > > > all > > > > > > > > > > > > usage of > > > > > > > > > > > > wbvind_on_all_cpus within i915 by either > > > > > > > > > > > > replacing > > > > > > > > > > > > the > > > > > > > > > > > > call > > > > > > > > > > > > with certain > > > > > > > > > > > > drm clflush helpers, or reverting to a previous > > > > > > > > > > > > logic. > > > > > > > > > > > > > > > > > > > > > > AFAIU, complaint from [1] was that it is wrong to > > > > > > > > > > > provide > > > > > > > > > > > non > > > > > > > > > > > x86 > > > > > > > > > > > implementations under the wbinvd_on_all_cpus > > > > > > > > > > > name. > > > > > > > > > > > Instead an > > > > > > > > > > > arch > > > > > > > > > > > agnostic helper which achieves the same effect > > > > > > > > > > > could > > > > > > > > > > > be > > > > > > > > > > > created. > > > > > > > > > > > Does > > > > > > > > > > > Arm have such concept? > > > > > > > > > > > > > > > > > > > > I also understand Linus' email like we shouldn't > > > > > > > > > > leak > > > > > > > > > > incoherent > > > > > > > > > > IO > > > > > > > > > > to > > > > > > > > > > other architectures, meaning any remaining > > > > > > > > > > wbinvd()s > > > > > > > > > > should > > > > > > > > > > be > > > > > > > > > > X86 > > > > > > > > > > only. > > > > > > > > > > > > > > > > > > The last part is completely obvious since it is a x86 > > > > > > > > > instruction > > > > > > > > > name. > > > > > > > > > > > > > > > > Yeah, I meant the function implementing wbinvd() > > > > > > > > semantics. > > > > > > > > > > > > > > > > > > > > > > > > > > But I think we can't pick a solution until we know > > > > > > > > > how > > > > > > > > > the > > > > > > > > > concept > > > > > > > > > maps > > > > > > > > > to Arm and that will also include seeing how the > > > > > > > > > drm_clflush_sg for > > > > > > > > > Arm > > > > > > > > > would look. Is there a range based solution, or just > > > > > > > > > a > > > > > > > > > big > > > > > > > > > hammer > > > > > > > > > there. > > > > > > > > > If the latter, then it is no good to churn all these > > > > > > > > > reverts > > > > > > > > > but > > > > > > > > > instead > > > > > > > > > an arch agnostic wrapper, with a generic name, would > > > > > > > > > be > > > > > > > > > the > > > > > > > > > way to > > > > > > > > > go. > > > > > > > > > > > > > > > > But my impression was that ARM would not need the > > > > > > > > range- > > > > > > > > based > > > > > > > > interface > > > > > > > > either, because ARM is only for discrete and with > > > > > > > > discrete > > > > > > > > we're > > > > > > > > always > > > > > > > > coherent. > > > > > > > > > > > > > > Not sure what you mean here - what about flushing system > > > > > > > memory > > > > > > > objects > > > > > > > on discrete? Those still need flushing on paths like > > > > > > > suspend > > > > > > > which this > > > > > > > series touches. Am I missing something? > > > > > > > > > > > > System bos on discrete should always have > > > > > > > > > > > > I915_BO_CACHE_COHERENT_FOR_READ | > > > > > > I915_BO_CACHE_COHERENT_FOR_WRITE > > > > > > > > > > > > either by the gpu being fully cache coherent (or us mapping > > > > > > system > > > > > > write-combined). Hence no need for cache clflushes or > > > > > > wbinvd() > > > > > > for > > > > > > incoherent IO. > > > > > > > > > > Hmm so you are talking about the shmem ttm backend. It ends > > > > > up > > > > > depending on the result of i915_ttm_cache_level, yes? It > > > > > cannot > > > > > end > > > > > up with I915_CACHE_NONE from that function? > > > > > > > > If the object is allocated with allowable placement in either > > > > LMEM > > > > or > > > > SYSTEM, and it ends in system, it gets allocated with > > > > I915_CACHE_NONE, > > > > but then the shmem ttm backend isn't used but TTM's wc pools, > > > > and > > > > the > > > > object should *always* be mapped wc. Even in system. > > > > > > I am not familiar with neither TTM backend or wc pools so maybe a > > > missed > > > question - if obj->cache_level can be set to none, and > > > obj->cache_coherency to zero, then during object lifetime helpers > > > which > > > consult those fields (like i915_gem_cpu_write_needs_clflush, > > > __start_cpu_write, etc) are giving out incorrect answers? That > > > is, it > > > is > > > irrelevant that they would say flushes are required, since in > > > actuality > > > those objects can never ever and from anywhere be mapped other > > > than > > > WC > > > so flushes aren't actually required? > > > > If we map other than WC somewhere in these situations, that should > > be a > > bug needing a fix. It might be that some of these helpers that you > > mention might still flag that a clflush is needed, and in that case > > that's an oversight that also needs fixing. > > > > > > > > > > I also found in i915_drm.h: > > > > > > > > > > * As caching mode when specifying > > > > > `I915_MMAP_OFFSET_FIXED`, > > > > > WC or WB will > > > > > * be used, depending on the object placement on > > > > > creation. WB > > > > > will be used > > > > > * when the object can only exist in system memory, > > > > > WC > > > > > otherwise. > > > > > > > > > > If what you say is true, that on discrete it is _always_ WC, > > > > > then > > > > > that needs updating as well. > > > > > > > > If an object is allocated as system only, then it is mapped WB, > > > > and > > > > we're relying on the gpu being cache coherent to avoid > > > > clflushes. > > > > Same > > > > is actually currently true if the object happens to be accessed > > > > by > > > > the > > > > cpu while evicted. Might need an update for that. > > > > > > Hmm okay, I think I actually misunderstood something here. I > > > think > > > the > > > reason for difference bbtween smem+lmem object which happens to > > > be in > > > smem and smem only object is eluding me. > > > > > > > > > > > > > > > That's adhering to Linus' > > > > > > > > > > > > "And I sincerely hope to the gods that no cache-incoherent > > > > > > i915 > > > > > > mess > > > > > > ever makes it out of the x86 world. Incoherent IO was > > > > > > always a > > > > > > historical mistake and should never ever happen again, so > > > > > > we > > > > > > should > > > > > > not spread that horrific pattern around." > > > > > > > > > > Sure, but I was not talking about IO - just the CPU side > > > > > access > > > > > to > > > > > CPU side objects. > > > > > > > > OK, I was under the impression that clflushes() and wbinvd()s > > > > in > > > > i915 > > > > was only ever used to make data visible to non-snooping GPUs. > > > > > > > > Do you mean that there are other uses as well? Agreed the wb > > > > cache > > > > flush on on suspend only if gpu is > > > > !I915_BO_CACHE_COHERENT_FOR_READ? > > > > looks to not fit this pattern completely. > > > > > > Don't know, I was first trying to understand handling of the > > > obj->cache_coherent as discussed in the first quote block. Are > > > the > > > flags > > > consistently set and how the Arm low level code will look. > > > > > > > Otherwise, for architectures where memory isn't always fully > > > > coherent > > > > with the cpu cache, I'd expect them to use the apis in > > > > asm/cacheflush.h, like flush_cache_range() and similar, which > > > > are > > > > nops > > > > on x86. > > > > > > Hm do you know why there are no-ops? Like why wouldn't they map > > > to > > > clflush? > > > > I think it mostly boils down to the PIPT caches on x86. Everything > > is > > assumed to be coherent. Whereas some architextures keep different > > cache > > entries for different virtual addresses even if the physical page > > is > > the same... > > > > clflushes and wbinvds on x86 are for odd arch-specific situations > > where, for example where we change caching attributes of the linear > > kernel map mappings. > > So in summary we have flush_cache_range which is generic, not > implemented on x86 and works with virtual addresses so not directly > usable even if x86 implementation was added. I think for the intended flush_cache_range() semantics: "Make this range visible to all vms on all cpus", I think the x86 implementation is actually a nop, and correctly implemented. > > There is also x86 specific clflush_cache_range which works with > virtual addresses as well so no good for drm_clflush_sg. > > Question you implicitly raise, correct me if I got it wrong, is > whether we should even be trying to extend drm_clflush_sg for Arm, > given how most (all?) call sites are not needed on discrete, is that > right? Yes exactly. No need to bother figuring this out for ARM, as we don't do any incoherent IO. > > Would that mean we could leave most of the code as is and just > replace wbinvd_on_all_cpus with something like i915_flush_cpu_caches, > which would then legitimately do nothing, at least on Arm if not also > on discrete in general? Yes, with the caveat that we should, at least as a second step, make i915_flush_cpu_caches() range-based if possible from a performance point of view. > > If that would work it would make a small and easy to review series. I > don't think it would collide with what Linus asked since it is not > propagating undesirable things further - given how if there is no > actual need to flush then there is no need to make it range based > either. > > Exception would be the dmabuf get pages patch which needs a proper > implementation of a new drm flush helper. I think the dmabuf get_pages (note that that's also only for integrated I915_CACHE_NONE x86-only situations), can be done with dma_buf_vmap(dma_buf, &virtual); drm_clflush_virt_range(virtual, length); dma_buf_vunmap(&virtual); /Thomas > > Regards, > > Tvrtko