On Thu, Sep 10, 2020 at 04:17:21PM +0100, Lorenzo Pieralisi wrote: > > In other words, the only usage here is only about Write. The CPU > > should never, ever, generate a MemRD TLP. The code never does a read > > explicitly. > > On arm64 pgprot_writecombine() is speculative memory (normal > non-cacheable), which may not do what you expect from it. Can you explain what this actually does on ARM? Can it ever speculate loads across page boundaries, or speculate loads that never exist in the program? ie will we get random unpredicable MemRds? Does it/could it "combine writes"? > > If the CPU fails to generate a 64 byte TLP then the device will still > > operate correctly but does a different, slower, flow. > > Side note: on ARM that TLP is not a native interconnect transaction, > reworded, it depends on what the system-bus->PCI logic does in > this respect. I think the issue is that ARM never defined what the bits set by pgprot_writecombine() do at a system level so we see implementations that do not cause write combining at the PCI-E interface for those bits. (I assume from what I've heard) > That's why I looped you in - that's what worries me about "enabling" > arch_can_pci_mmap_wc() on arm64. If we enable it and we have perf > regressions that's not OK. > > Or we *can* enable arch_can_pci_mmap_wc() but force the mellanox > driver (or more broadly all drivers following this message push > semantics) to use "something else" for WC detection. arch_can_pci_mmap_wc() really only controls the sysfs resource file and it seems very unclear who in userspace uses that these days. vfio is now the right way to do that stuff. I don't see an obvious way to get WC memory in VFIO though... So, I don't think this patch will break anything, however I also don't see a point to it as nobody should be using the sys/resource files these days.. Curios why Clint proposed it? > > Thus, mlx5 switched to doing a runtime WC test to determine if the CPU > > actually supports WC or not. If the arch can reliably tell the driver > > then this test could be avoided. Based on this test the WC mode is > > allowed for userspace. > > Can you elaborate on this runtime test please ? mlx5 is a network device, so the purpose of the 64 byte TLP is to push, say, a network send command to the device fully formed (eg with the DMA list, etc). Otherwise we can only push an indication to read the 64 byte command via DMA from the command ring. Avoiding the extra DMA is a performance win. The runtime test pushes a 64 byte command to the device. If it arrives as a single TLP then the device executes that command, otherwise it triggers DMA and reads the 64 bytes from host memory. The test arranges that the command pushed via the TLP and the command fetched via DMA are different. The test can then determine which path the device took and thus know directly if the implementation can deliver the required TLP or not. > > configured in a way that can't run the test, here we use > > arch_can_pci_mmap_wc() to guess if the CPU has working WC or not. > > Ideally an arch would return 1 only when the CPU has working WC. > > Which means we can guarantee the TLP packet you mentioned above I > guess ? Yes. For a PCI driver I think this is the only thing that matters for "write combining". mlx5 is pretty strict, other drivers might gain advantage from smaller or more fragmented transfers, eg 32 byte chunks or something. > > Depending on workload WC may not be a win. In those cases userspace > > will select NC. Thus the same PCI MMIO BAR region can have a mixture > > of pages with WC and NC mappings to userspace. > > > > For DEVICE_GRE.. For years now, many deployments of ARM & mlx5 devices > > are using an out of tree patch to use DEVICE_GRE for WC on mlx5. This > > seems to be the preferred working configuration on at least some ARM > > SOCs. So far nobody from the ARM world has shown interest in making a > > mainline solution. :( > > > > I can't recall if this is because the relevant ARM SOC's don't support > > pgprot_writecombine(), or it doesn't work properly. > > > > I was told the reason ARM never enabled WC was because unaligned > > When you say "enabled WC" I assume you mean making: > > pgprot_writecombine() == DEVICE_GRE Well, when I say "enabled WC" I mean it seems existing pgprot_writecombine() does not give write combining at PCI-E and the option that does give write combining is DEVICE_GRE, that has the alignment problem and can't be used. At least on some SOCs. > > access to WC memory was not supported, and there were existing drivers > > that did unaligned writes that would malfunction. I thought this meant > > that pgprot_writecombine() was non-working in ARM Linux? > > On arm64 pgprot_writecombine() is normal non-cacheable memory at the > moment - it works but that does not precisely do what you *expect* from > arch_can_pci_mmap_wc(), that's the whole point I am making. Most places use pgprot_writecombine() blindly and just ignore arch_can_pci_mmap_wc() - I suppose with the idea that pgprot_writecombine() will be a harmless NOP if it isn't supported? mlx5 is possibly the only place where someone actually tested and cared about the performance difference between using a WC specific algorithm or not. > > mlx5 is more or less a representative user WC for this kind of > > 'message push' methodology. Several other RDMA devices do this as > > well. The methodology is important enough that recent Intel CPUs have > > a dedicated instruction to push a 128 byte message in a single TLP > > avoiding this whole WC mess. > > > > Frankly, I think the kernel should introduce a well defined pgprot for > > this working mode that all archs can agree upon. It should include the > > alignment requirement, message push function, CPU barrier macros, and > > locking macros that are needed to use this facility correctly. > > > > Defined in a way that is compatible with DEVICE_GRE and can be used by > > these 'message push' drivers. That would switch alway most of the > > users in the kernel today. > > That's probably the way forward - I still have concerns about this > patch as it stands given your clarifications above. I would very much like to see some support for DEVICE_GRE (at least on the SOCs that require it) in ARM going forward. For whatever reason this seems to be necessary to get write PCI combining behavior on enough ARM hardware that it should be supported in the upstream kernel. Jason