On 06/08/18 11:25, Mikulas Patocka wrote:
[...]
None of this explains why some transactions fail to make it across
entirely. The overlapping writes in question write the same data to
the memory locations that are covered by both, and so the ordering in
which the transactions are received should not affect the outcome.
You're right that the corruption couldn't be explained just by reordering
writes. My hypothesis is that the PCIe controller tries to disambiguate
the overlapping writes, but the disambiguation logic was not tested and it
is buggy. If there's a barrier between the overlapping writes, the PCIe
controller won't see any overlapping writes, so it won't trigger the
faulty disambiguation logic and it works.
Could the ARM engineers look if there's some chicken bit in Cortex-A72
that could insert barriers between non-cached writes automatically?
I don't think there is, and even if there was I imagine it would have a
pretty hideous effect on non-coherent DMA buffers and the various other
places in which we have Normal-NC mappings of actual system RAM.
I observe these kinds of corruptions:
- failing to write a few bytes
That could potentially be explained by the reordering/atomicity issues
Matt mentioned, i.e. the load is observing part of the store, before the
store has fully completed.
- writing a few bytes that were written 16 bytes before
- writing a few bytes that were written 16 bytes after
Those sound more like the interconnect or root complex ignoring the byte
strobes on an unaligned burst, of which I think the simplistic view
would be "it's broken".
FWIW I stuck my old Nvidia 7600GT card in my Arm Juno r2 board (2x
Cortex-A72), built your test program natively with GCC 8.1.1 at -O2, and
it's still happily flickering pixels in the corner of the console after
nearly an hour (in parallel with some iperf3 just to ensure plenty of
PCIe traffic). I would strongly suspect this issue is particular to
Armada 8k, so its' probably one for the Marvell folks to take a closer
look at - I believe some previous interconnect issues on those SoCs were
actually fixable in firmware.
Robin.