在 2024-06-17星期一的 15:59 +0200,Christian König写道:Am 17.06.24 um 15:43 schrieb Icenowy Zheng:在 2024-06-17星期一的 15:09 +0200,Christian König写道:Am 17.06.24 um 15:03 schrieb Icenowy Zheng:在 2024-06-17星期一的 14:35 +0200,Christian König写道:Am 17.06.24 um 12:58 schrieb Icenowy Zheng:The duplication of EOP packets for GFX7/8, with the former one have seq-1 written and the latter one have seq written, seems to confuse some hardware platform (e.g. Loongson 7A series PCIe controllers). Make the content of the duplicated EOP packet the same with the real one, only masking any possible interrupts.Well completely NAK to that, exactly that disables the workaround. The CPU needs to see two different values written here.Why do the CPU need to see two different values here? Only the second packet will raise an interrupt before and after applying this patch, and the first packet's result should just be overriden on ordinary platforms. The CPU won't see the first one, until it's polling for the address for a very short interval, so short that the GPU CP couldn't execute 2 commands.Yes exactly that. We need to make two writes, one with the old value (seq - 1) and a second with the real value (seq). Otherwise it is possible that a polling CPU would see the sequence before the second EOP is issued with results in incoherent view of memory.In this case shouldn't we write seq-1 before any work, and then write seq after work, like what is done in Mesa?No. This hw workaround requires that two consecutive write operations happen directly behind each other on the PCIe bus with two different values.Well to be honest the workaround code in Mesa seems to not be working in this way ...
Mesa doesn't have any workaround for that hw issue, the code there uses a quite different approach.
To make the software logic around that work without any changes we use the values seq - 1 and seq because those are guaranteed to be different and not trigger any unwanted software behavior. Only then we can guarantee that we have a coherent view of system memory.Any more details about it?
No, sorry. All I know is that it's a bug in the cache flush logic which can be worked around by issuing two write behind each other to the same location.
BTW in this case, could I try to write it for 3 times instead of 2, with seq-1, seq and seq?
That could potentially work as well, but at some point we would need to increase the EOP ring buffer size or could run into performance issues.
As what I see, Mesa uses another command buffer to emit a EVENT_WRITE_EOP writing 0, and commit this command buffer before the real command buffer.Or do you mean the GPU needs to see two different values being written, or they will be merged into only one write request? Please give out more information about this workaround, otherwise the GPU hang problem on Loongson platforms will persist.Well if Loongson can't handle two consecutive write operations to the same address with different values then you have a massive platform bug.I think the issue is triggered when two consecutive write operations and one IRQ is present, which is exactly the case of this function.Well then you have a massive platform bug. Two consecutive writes to the same bus address are perfectly legal from the PCIe specification and can happen all the time, even without this specific hw workaround.Yes I know it, and I am not from Loongson, just some user trying to mess around it.
Well to be honest on a platform where even two consecutive writes to the same location doesn't work I would have strong doubts that it is stable in general.
Regards,
Christian.