On 6 August 2018 at 14:42, Robin Murphy <robin.murphy@xxxxxxx> wrote: > On 06/08/18 11:25, Mikulas Patocka wrote: > [...] >>> >>> None of this explains why some transactions fail to make it across >>> entirely. The overlapping writes in question write the same data to >>> the memory locations that are covered by both, and so the ordering in >>> which the transactions are received should not affect the outcome. >> >> >> You're right that the corruption couldn't be explained just by reordering >> writes. My hypothesis is that the PCIe controller tries to disambiguate >> the overlapping writes, but the disambiguation logic was not tested and it >> is buggy. If there's a barrier between the overlapping writes, the PCIe >> controller won't see any overlapping writes, so it won't trigger the >> faulty disambiguation logic and it works. >> >> Could the ARM engineers look if there's some chicken bit in Cortex-A72 >> that could insert barriers between non-cached writes automatically? > > > I don't think there is, and even if there was I imagine it would have a > pretty hideous effect on non-coherent DMA buffers and the various other > places in which we have Normal-NC mappings of actual system RAM. > >> I observe these kinds of corruptions: >> - failing to write a few bytes > > > That could potentially be explained by the reordering/atomicity issues Matt > mentioned, i.e. the load is observing part of the store, before the store > has fully completed. > OK, so that means the unaligned transaction gets split, and the subtransactions are reordered with the aligned transaction so that the sub-writes contain stale values from the sub-reads? >> - writing a few bytes that were written 16 bytes before >> - writing a few bytes that were written 16 bytes after > > > Those sound more like the interconnect or root complex ignoring the byte > strobes on an unaligned burst, of which I think the simplistic view would be > "it's broken". > > FWIW I stuck my old Nvidia 7600GT card in my Arm Juno r2 board (2x > Cortex-A72), built your test program natively with GCC 8.1.1 at -O2, and > it's still happily flickering pixels in the corner of the console after > nearly an hour (in parallel with some iperf3 just to ensure plenty of PCIe > traffic). I would strongly suspect this issue is particular to Armada 8k, so > its' probably one for the Marvell folks to take a closer look at - I believe > some previous interconnect issues on those SoCs were actually fixable in > firmware. > IIRC that was DVM dropping a few VA bits at the top, and a single MMIO control bit to put it back into 'non-broken' mode.