Hi Matt, On Fri, Aug 03, 2018 at 03:44:44PM -0500, Matt Sealey wrote: > On 3 August 2018 at 13:25, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote: > > On Fri, 3 Aug 2018, Ard Biesheuvel wrote: > >> Are we still talking about overlapping unaligned accesses here? Or do > >> you see other failures as well? > > > > Yes - it is caused by overlapping unaligned accesses inside memcpy. > > When I put "dmb sy" between the overlapping accesses in > > glibc/sysdeps/aarch64/memcpy.S, this program doesn't detect any > > memory corruption. > > It is a symptom of generating reorderable accesses inside memcpy. It's nothing > to do with alignment, per se (see below). A dmb sy just hides the symptoms. > > What we're talking about here - yes, Ard, within certain amounts of > reason - is that you cannot use PCI BAR memory as 'Normal' - certainly > never cacheable memory, but Normal NC isn't good either. That is that > your CPU cannot post writes or reads towards PCI memory spaces unless > it is dealing with it as Device memory or very strictly controlled use > of Normal Non-Cacheable. I disagree that it's not possible to use Normal NC on prefetchable BARs. This particular case looks more like a hardware issue to me as other platforms don't exhibit the same behaviour. Note that allowing Normal NC mapping of prefetchable BARs together with unaliagned accesses is also a requirement for SBSA-compliant platforms ([1]; though I don't find the text in D.2 very clear). > >> > I tried to run it on system RAM mapped with the NC attribute and I didn't > >> > get any corruption - that suggests the the bug may be in the PCIE > >> > subsystem. > > Pure fluke. Do you mean you don't expect Mikulas' test to run fine on system RAM with Normal NC mapping? We would have bigger issues if this was the case. > I'll give a simple explanation. The Arm Architecture defines > single-copy and multi-copy atomic transactions. You can treat > 'single-copy' to mean that that transaction cannot be made partial, or > reordered within itself, i.e. it must modify memory (if it is a store) > in a single swift effort and any future reads from that memory must > return the FULL result of that write. > > Multi-copy means it can be resized and reordered a bit. Will Deacon is > going to crucify me for simplifying it, but.. let's proceed with a > poor example: Not sure about Will but I think you got them wrong ;). The single/multi copy atomicity is considered in respect to (multiple) observers, a.k.a. masters, and nothing to do with reordering a bit (see B2.2 in the ARMv8 ARM). > STR X0,[X1] on a 32-bit bus cannot ever be single-copy atomic, because > you cannot write 64-bits of data on a 32-bit bus in a single, > unbreakable transaction. This is because from one bus cycle to the > next, one half of the transaction will be in a different place. Your > interconnect will have latched and buffered 32-bits and the CPU is > holding the other. It depends on the implementation, interconnect, buses. Since single-copy atomicity refers to master accesses, the above transaction could be a burst of two 32-bit writes and treated atomically by the interconnect (i.e. not interruptible). > STP X0, X1, [X2] on a 64-bit bus can be single-copy atomic with > respect to the element size. But it is on the whole multi-copy atomic > - that is to say that it can provide a single transaction with > multiple elements which are transmitted, and those elements could be > messed with on the way down the pipe. This has nothing to do with multi-copy atomicity which actually refers to multiple observers seeing the same write. The ARM architecture is not exactly multi-copy atomic anyway (rather "other-multi-copy atomic"). Architecturally, STP is treated as two single-copy accesses (as you mentioned already). Anyway, the single/multiple copy atomicity is irrelevant for the C test from Mikulas where you have the same observer (the CPU) writing and reading the memory. I wonder whether writing a byte and reading a long back would show similar corruption. > And the granularity of the hazarding in your system, from the CPU > store buffer to the bus interface to the interconnect buffering to the > PCIe bridge to the PCIe EP is.. what? Not the same all the way down, > I'll bet you. I think hazarding is what goes wrong here, especially since with overlapping unaligned addresses. However, I disagree that it is impossible to implement this properly on a platform with PCIe so that Normal NC mappings can be used. Thanks. [1] https://developer.arm.com/docs/den0029/latest/server-base-system-architecture -- Catalin