Re: [PATCH 0/4] Drop wbinvd_on_all_cpus usage

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Tue, 22 Mar 2022 16:07:17 +0100

On 3/22/22 13:53, Tvrtko Ursulin wrote:

On 22/03/2022 11:37, Thomas Hellström wrote:
On Tue, 2022-03-22 at 11:20 +0000, Tvrtko Ursulin wrote:

On 22/03/2022 10:26, Thomas Hellström wrote:
On Tue, 2022-03-22 at 10:13 +0000, Tvrtko Ursulin wrote:

On 21/03/2022 15:15, Thomas Hellström wrote:
On Mon, 2022-03-21 at 14:43 +0000, Tvrtko Ursulin wrote:

On 21/03/2022 13:40, Thomas Hellström wrote:
Hi,

On Mon, 2022-03-21 at 13:12 +0000, Tvrtko Ursulin wrote:

On 21/03/2022 12:33, Thomas Hellström wrote:
On Mon, 2022-03-21 at 12:22 +0000, Tvrtko Ursulin
wrote:

On 21/03/2022 11:03, Thomas Hellström wrote:
Hi, Tvrtko.

On 3/21/22 11:27, Tvrtko Ursulin wrote:

On 19/03/2022 19:42, Michael Cheng wrote:
To align with the discussion in [1][2], this
patch
series
drops
all
usage of
wbvind_on_all_cpus within i915 by either
replacing
the
call
with certain
drm clflush helpers, or reverting to a previous
logic.

AFAIU, complaint from [1] was that it is wrong to
provide
non
x86
implementations under the wbinvd_on_all_cpus
name.
Instead an
arch
agnostic helper which achieves the same effect
could
be
created.
Does
Arm have such concept?

I also understand Linus' email like we shouldn't
leak
incoherent
IO
to
other architectures, meaning any remaining
wbinvd()s
should
be
X86
only.

The last part is completely obvious since it is a x86
instruction
name.

Yeah, I meant the function implementing wbinvd()
semantics.

But I think we can't pick a solution until we know
how
the
concept
maps
to Arm and that will also include seeing how the
drm_clflush_sg for
Arm
would look. Is there a range based solution, or just
a
big
hammer
there.
If the latter, then it is no good to churn all these
reverts
but
instead
an arch agnostic wrapper, with a generic name, would
be
the
way to
go.

But my impression was that ARM would not need the
range-
based
interface
either, because ARM is only for discrete and with
discrete
we're
always
coherent.

Not sure what you mean here - what about flushing system
memory
objects
on discrete? Those still need flushing on paths like
suspend
which this
series touches. Am I missing something?

System bos on discrete should always have

I915_BO_CACHE_COHERENT_FOR_READ |
I915_BO_CACHE_COHERENT_FOR_WRITE

either by the gpu being fully cache coherent (or us mapping
system
write-combined). Hence no need for cache clflushes or
wbinvd()
for
incoherent IO.

Hmm so you are talking about the shmem ttm backend. It ends
up
depending on the result of i915_ttm_cache_level, yes? It
cannot
end
up with I915_CACHE_NONE from that function?

If the object is allocated with allowable placement in either
LMEM
or
SYSTEM, and it ends in system, it gets allocated with
I915_CACHE_NONE,
but then the shmem ttm backend isn't used but TTM's wc pools,
and
the
object should *always* be mapped wc. Even in system.

I am not familiar with neither TTM backend or wc pools so maybe a
missed
question - if obj->cache_level can be set to none, and
obj->cache_coherency to zero, then during object lifetime helpers
which
consult those fields (like i915_gem_cpu_write_needs_clflush,
__start_cpu_write, etc) are giving out incorrect answers? That
is, it
is
irrelevant that they would say flushes are required, since in
actuality
those objects can never ever and from anywhere be mapped other
than
WC
so flushes aren't actually required?

If we map other than WC somewhere in these situations, that should
be a
bug needing a fix. It might be that some of these helpers that you
mention might still flag that a clflush is needed, and in that case
that's an oversight that also needs fixing.

I also found in i915_drm.h:

            * As caching mode when specifying
`I915_MMAP_OFFSET_FIXED`,
WC or WB will
            * be used, depending on the object placement on
creation. WB
will be used
            * when the object can only exist in system memory,
WC
otherwise.

If what you say is true, that on discrete it is _always_ WC,
then
that needs updating as well.

If an object is allocated as system only, then it is mapped WB,
and
we're relying on the gpu being cache coherent to avoid
clflushes.
Same
is actually currently true if the object happens to be accessed
by
the
cpu while evicted. Might need an update for that.

Hmm okay, I think I actually misunderstood something here. I
think
the
reason for difference bbtween smem+lmem object which happens to
be in
smem and smem only object is eluding me.

That's adhering to Linus'

"And I sincerely hope to the gods that no cache-incoherent
i915
mess
ever makes it out of the x86 world. Incoherent IO was
always a
historical mistake and should never ever happen again, so
we
should
not spread that horrific pattern around."

Sure, but I was not talking about IO - just the CPU side
access
to
CPU side objects.

OK, I was under the impression that clflushes() and wbinvd()s
in
i915
was only ever used to make data visible to non-snooping GPUs.

Do you mean that there are other uses as well? Agreed the wb
cache
flush on on suspend only if gpu is
!I915_BO_CACHE_COHERENT_FOR_READ?
looks to not fit this pattern completely.

Don't know, I was first trying to understand handling of the
obj->cache_coherent as discussed in the first quote block. Are
the
flags
consistently set and how the Arm low level code will look.

Otherwise, for architectures where memory isn't always fully
coherent
with the cpu cache, I'd expect them to use the apis in
asm/cacheflush.h, like flush_cache_range() and similar, which
are
nops
on x86.

Hm do you know why there are no-ops? Like why wouldn't they map
to
clflush?

I think it mostly boils down to the PIPT caches on x86. Everything
is
assumed to be coherent. Whereas some architextures keep different
cache
entries for different virtual addresses even if the physical page
is
the same...

clflushes and wbinvds on x86 are for odd arch-specific situations
where, for example where we change caching attributes of the linear
kernel map mappings.

So in summary we have flush_cache_range which is generic, not
implemented on x86 and works with virtual addresses so not directly
usable even if x86 implementation was added.

I think for the intended flush_cache_range() semantics: "Make this
range visible to all vms on all cpus", I think the x86 implementation
is actually a nop, and correctly implemented.

If that is so then I agree. (I did not spend much time looking for 
desired semantics, just noticed there was no kerneldoc next to the 
function and stopped there.)

There is also x86 specific clflush_cache_range which works with
virtual addresses as well so no good for drm_clflush_sg.

Question you implicitly raise, correct me if I got it wrong, is
whether we should even be trying to extend drm_clflush_sg for Arm,
given how most (all?) call sites are not needed on discrete, is that
right?

Yes exactly. No need to bother figuring this out for ARM, as we don't
do any incoherent IO.

Would that mean we could leave most of the code as is and just
replace wbinvd_on_all_cpus with something like i915_flush_cpu_caches,
which would then legitimately do nothing, at least on Arm if not also
on discrete in general?

Yes, with the caveat that we should, at least as a second step, make
i915_flush_cpu_caches() range-based if possible from a performance
point of view.

Sounds like a plan, and I am counting on the second step part to be 
really second step. Because that one will need to actually figure out 
and elaborate sufficiently all three proposed reverts, which was 
missing in this posting. So first step unblocks Arm builds very 
cheaply and non-controversially, second step tries going the range route.

If that would work it would make a small and easy to review series. I
don't think it would collide with what Linus asked since it is not
propagating undesirable things further - given how if there is no
actual need to flush then there is no need to make it range based
either.

Exception would be the dmabuf get pages patch which needs a proper
implementation of a new drm flush helper.

I think the dmabuf get_pages (note that that's also only for integrated
I915_CACHE_NONE x86-only situations), can be done with

dma_buf_vmap(dma_buf, &virtual);
drm_clflush_virt_range(virtual, length);
dma_buf_vunmap(&virtual);

Looks plausible to me. Downside being it vmaps the whole object at 
once so may regress, at least on 32-bit (!) builds. Would it work in 
theory to fall back to page by page but would it be worth it just for 
32-bit I am not sure.

Back in the days IIRC there was a kmap() api also for dma-buf. But 
nobody used it, and yes, vmap is not ideal but a simple fallback to 
page-based (or even wbinvd on the rare occasion of vmap error) might be ok.

/Thomas

Regards,

Tvrtko