Re: [PATCH 1/2] drm/amdgpu: make duplicated EOP packet for GFX7/8 have real content

Christian König <christian.koenig@xxxxxxx> · Mon, 17 Jun 2024 17:07:14 +0200

    Am 17.06.24 um 16:57 schrieb Icenowy Zheng:

      在 2024-06-17星期一的 16:42 +0200，Christian König写道：

        Am 17.06.24 um 16:30 schrieb Icenowy Zheng:

          在 2024-06-17星期一的 15:59 +0200，Christian König写道：

            Am 17.06.24 um 15:43 schrieb Icenowy Zheng:

              在 2024-06-17星期一的 15:09 +0200，Christian König写道：

              ...

              In this case shouldn't we write seq-1 before any work, and then
write
seq after work, like what is done in Mesa?

            No. This hw workaround requires that two consecutive write
operations
happen directly behind each other on the PCIe bus with two
different
values.

          Well to be honest the workaround code in Mesa seems to not be
working
in this way ...

        Mesa doesn't have any workaround for that hw issue, the code there
uses 
a quite different approach.

      Ah? Commit bf26da927a1c ("drm/amdgpu: add cache flush workaround to
gfx8 emit_fence") says "Both PAL and Mesa use it for gfx8 too, so port
this commit to gfx_v8_0_ring_emit_fence_gfx", so maybe the workaround
should just be not necessary here?

    What I meant was that Mesa doesn't have a hack like writing seq - 1
    and then seq.

    I haven't checked the code, but it uses a different approach with
    64bit values as far as I know.

            To make the software logic around that work without any changes
we
use
the values seq - 1 and seq because those are guaranteed to be
different
and not trigger any unwanted software behavior.

Only then we can guarantee that we have a coherent view of system
memory.

          Any more details about it?

        No, sorry. All I know is that it's a bug in the cache flush logic
which 
can be worked around by issuing two write behind each other to the
same 
location.

      So the issue is that the first EOP write does not properly flush the
cache? Could EVENT_WRITE be used instead of EVENT_WRITE_EOP in this
workaround to properly flush it without hurting the fence value?

    No, EVENT_WRITE is executed at a different time in the pipeline.

        ...

        Well to be honest on a platform where even two consecutive writes to
the 
same location doesn't work I would have strong doubts that it is
stable 
in general.

      Well I think the current situation is that the IRQ triggered by the
second EOP packet arrives before the second write is finished, not the
second write is totally dropped.

    Well that sounds like the usual re-ordering problems we have seen
    patches for on Loongson multiple times now.

    And I can only repeat what I've wrote before: We don't accept
    workarounds in drivers for problems cause by severely platform
    issues.

    Especially when that is clearly against any PCIe specification.

    Regards,

    Christian.

        Regards,
Christian.