On Thu, Sep 10, 2020 at 10:46:00AM +0100, Lorenzo Pieralisi wrote: > [+Jason] > > On Tue, Sep 08, 2020 at 09:33:42AM +1000, Benjamin Herrenschmidt wrote: > > On Thu, 2020-09-03 at 12:08 +0100, Lorenzo Pieralisi wrote: > > > > It's been what other architectures have been doing for mroe than a > > > > decade without significant issues... I don't think you should worry > > > > too > > > > much about this. > > > > > > Minus what I wrote above, I agree with you. I'd still be able to > > > understand what this patch changes in the mellanox driver HW > > > handling though - not sure what they expect from > > > arch_can_pci_mmap_wc() > > > returning 1. > > > > I don't know enough to get into the finer details but looking a bit it > > seems when this is set, they allow extra ioctls to create buffers > > mapped with pgprot_writecombine(). > > > > I suppose this means faster MMIO backet buffers for small packets (ie, > > non-DMA use case). > > > > Also note that mlx5_ib_test_wc() only uses arch_can_pci_mmap_wc() for a > > non-ROCE ethernet port on a PF... For anyting else, it just seems to > > actually try to do it and see what happens :-) > > > > Leon: Can you clarify the use of arch_can_pci_mmap_wc() in mlx5 and > > whether you see an issue with enabling this on arm64 ? > > Hi Jason, > > I was wondering if you could help us with this question, we are trying > to understand what enabling arch_can_pci_mmap_wc() on arm64 would cause > in mellanox drivers wrt mappings and whether there is an expected > behaviour behind them, in particular whether there is an implicit > reliance on x86 write-combine arch/interconnect details. Looking back at this big thread, let me add some perspective Mellanox drivers have a performance optimization where a 64 byte MemWr TLP from the root complex to the MMIO BAR will perform better, often quite a bit better. We run WC in full QA'd production on PPC, ARM and x86. The userspace generates a burst of sequential, aligned 8 byte CPU writes to the MMIO address and triggers an arch specific CPU barrier to flush/fence the CPU WC buffer. At this point the CPU should emit the 64 byte TLP toward the device ASAP. In other words, the only usage here is only about Write. The CPU should never, ever, generate a MemRD TLP. The code never does a read explicitly. If the CPU fails to generate a 64 byte TLP then the device will still operate correctly but does a different, slower, flow. If the CPU consistently fails WC then the overhead of trying the WC flow is a notable net performance loss, and on these CPUs we want to use only 8 byte write to the MMIO BAR, with NC memory. There are many important details about how this works and how this must interact with the CPU barriers and locking. On x86, arch_can_pci_mmap_wc() is basically meaningless. It indicates there is a chance that pgprot_writecombine() could work. It can also be 0 and write combining will work just fine :\. Thus, mlx5 switched to doing a runtime WC test to determine if the CPU actually supports WC or not. If the arch can reliably tell the driver then this test could be avoided. Based on this test the WC mode is allowed for userspace. The one call to arch_can_pci_mmap_wc() is in a case where the HW is configured in a way that can't run the test, here we use arch_can_pci_mmap_wc() to guess if the CPU has working WC or not. Ideally an arch would return 1 only when the CPU has working WC. Depending on workload WC may not be a win. In those cases userspace will select NC. Thus the same PCI MMIO BAR region can have a mixture of pages with WC and NC mappings to userspace. For DEVICE_GRE.. For years now, many deployments of ARM & mlx5 devices are using an out of tree patch to use DEVICE_GRE for WC on mlx5. This seems to be the preferred working configuration on at least some ARM SOCs. So far nobody from the ARM world has shown interest in making a mainline solution. :( I can't recall if this is because the relevant ARM SOC's don't support pgprot_writecombine(), or it doesn't work properly. I was told the reason ARM never enabled WC was because unaligned access to WC memory was not supported, and there were existing drivers that did unaligned writes that would malfunction. I thought this meant that pgprot_writecombine() was non-working in ARM Linux? So, bit surprised to see a patch messing with arch_can_pci_mmap_wc() and not changing the defintion of pgprot_writecombine() ? mlx5 is more or less a representative user WC for this kind of 'message push' methodology. Several other RDMA devices do this as well. The methodology is important enough that recent Intel CPUs have a dedicated instruction to push a 128 byte message in a single TLP avoiding this whole WC mess. Frankly, I think the kernel should introduce a well defined pgprot for this working mode that all archs can agree upon. It should include the alignment requirement, message push function, CPU barrier macros, and locking macros that are needed to use this facility correctly. Defined in a way that is compatible with DEVICE_GRE and can be used by these 'message push' drivers. That would switch alway most of the users in the kernel today. Jason