Re: [PATCH 1/2] drm/amdgpu: make duplicated EOP packet for GFX7/8 have real content

Christian König <christian.koenig@xxxxxxx> · Mon, 17 Jun 2024 16:42:26 +0200

    Am 17.06.24 um 16:30 schrieb Icenowy Zheng:

      在 2024-06-17星期一的 15:59 +0200，Christian König写道：

        Am 17.06.24 um 15:43 schrieb Icenowy Zheng:

          在 2024-06-17星期一的 15:09 +0200，Christian König写道：

            Am 17.06.24 um 15:03 schrieb Icenowy Zheng:

              在 2024-06-17星期一的 14:35 +0200，Christian König写道：

                Am 17.06.24 um 12:58 schrieb Icenowy Zheng:

                  The duplication of EOP packets for GFX7/8, with the former
one
have
seq-1 written and the latter one have seq written, seems to
confuse
some
hardware platform (e.g. Loongson 7A series PCIe
controllers).

Make the content of the duplicated EOP packet the same with
the
real
one, only masking any possible interrupts.

                Well completely NAK to that, exactly that disables the
workaround.

The CPU needs to see two different values written here.

              Why do the CPU need to see two different values here? Only the
second
packet will raise an interrupt before and after applying this
patch,
and the first packet's result should just be overriden on
ordinary
platforms. The CPU won't see the first one, until it's polling
for
the
address for a very short interval, so short that the GPU CP
couldn't
execute 2 commands.

            Yes exactly that. We need to make two writes, one with the old
value
(seq - 1) and a second with the real value (seq).

Otherwise it is possible that a polling CPU would see the
sequence
before the second EOP is issued with results in incoherent view
of
memory.

          In this case shouldn't we write seq-1 before any work, and then
write
seq after work, like what is done in Mesa?

        No. This hw workaround requires that two consecutive write operations
happen directly behind each other on the PCIe bus with two different
values.

      Well to be honest the workaround code in Mesa seems to not be working
in this way ...

    Mesa doesn't have any workaround for that hw issue, the code there
    uses a quite different approach.

        To make the software logic around that work without any changes we
use 
the values seq - 1 and seq because those are guaranteed to be
different 
and not trigger any unwanted software behavior.

Only then we can guarantee that we have a coherent view of system
memory.

      Any more details about it?

    No, sorry. All I know is that it's a bug in the cache flush logic
    which can be worked around by issuing two write behind each other to
    the same location.

      BTW in this case, could I try to write it for 3 times instead of 2,
with seq-1, seq and seq?

    That could potentially work as well, but at some point we would need
    to increase the EOP ring buffer size or could run into performance
    issues.

          As what I see, Mesa uses another command buffer to emit a
EVENT_WRITE_EOP writing 0, and commit this command buffer before
the
real command buffer.

              Or do you mean the GPU needs to see two different values being
written,
or they will be merged into only one write request?

Please give out more information about this workaround,
otherwise
the
GPU hang problem on Loongson platforms will persist.

            Well if Loongson can't handle two consecutive write operations to
the
same address with different values then you have a massive
platform
bug.

          I think the issue is triggered when two consecutive write
operations
and one IRQ is present, which is exactly the case of this function.

        Well then you have a massive platform bug.

Two consecutive writes to the same bus address are perfectly legal
from 
the PCIe specification and can happen all the time, even without this
specific hw workaround.

      Yes I know it, and I am not from Loongson, just some user trying to
mess around it.

    Well to be honest on a platform where even two consecutive writes to
    the same location doesn't work I would have strong doubts that it is
    stable in general.

    Regards,

    Christian.