On Tue, 26 Feb 2019, Will Deacon wrote: > > If they are the same device (just different data ports), I'd > > *definitely* expect them to be ordered. > > > > We have tons of code that depends on that. Almost every driver out > > there, in fact. > > > > So we need the mb() on alpha to guarantee the access ordering on the > > CPU side, and then PCI itself ends up guaranteeing that accesses to > > the same device will remain ordered outside the CPU. > > > > Agreed? > > Yup, agreed. I'd consider all those ports to be the same endpoint, so we're > good. FAOD, I think this assumption/requirement only applies to the plain accessors (`inX', `readX', `ioreadX', etc.). For performance reasons we may decide sometime to opt in for accessors that do not suffer from the requirement to be strongly ordered WRT each other, for the benefit to architectures that are not strongly ordered with MMIO and that suffer a lot from serialising accesses that do not really care, e.g. where you need to load a bunch of device registers or maybe even device RAM in any order before making a serialised final request to accept the values loaded. I made provisions for that with a driver I recently added with commit 61414f5ec983 ("FDDI: defza: Add support for DEC FDDIcontroller 700 TURBOchannel adapter"), where locally defined accessor macros suffixed with `_o' and `_u' denote accesses that have to be strongly ordered and can be weakly ordered respectively WRT each other. Right now they all expand to the respective `_relaxed' accessors (with a lone `dma_rmb' inserted appropriately; yes, the device does DMA one way only, and the other one is PIO with a lot of MMIO traffic to board RAM that would benefit from omitting barriers), however they can be replaced with references to truly unordered accessors if we ever have them. That piece of hardware is however rather peculiar and not an example of the most common design seen nowadays and I am not sure if the extra maintenance burden across all the ports for any additional accessors would be outweighed by the benefit for the weakly ordered MMIO architectures (where an execution stall can indeed count in hundreds of clock cycles per barrier inserted) combined with the appreciation (i.e. actual use) level from driver writers who do not necessarily grok all that weak ordering business. Maciej