On Sun, Mar 17, 2024, at 12:54, Niklas Cassel wrote: > On Fri, Mar 15, 2024 at 06:29:52PM +0100, Arnd Bergmann wrote: >> On Fri, Mar 15, 2024, at 07:44, Manivannan Sadhasivam wrote: >> >> I think there are three separate questions here when talking about >> a scenario where a PCI master accesses memory behind a PCI endpoint: > > I think the question is if the PCI epf-core, which runs on the endpoint > side, and which calls dma_alloc_coherent() to allocate backing memory for > a BAR, can set/mark the Prefetchable bit for the BAR (if we also set/mark > the BAR as a 64-bit BAR). > > The PCIe 6.0 spec, 7.5.1.2.1 Base Address Registers (Offset 10h - 24h), > states: > "Any device that has a range that behaves like normal memory should mark > the range as prefetchable. A linear frame buffer in a graphics device is > an example of a range that should be marked prefetchable." > > Does not backing memory allocated for a specific BAR using > dma_alloc_coherent() on the EP side behave like normal memory from the > host's point of view? I'm not sure I follow this logic: If the device wants the buffer to act like "normal memory", then it can be marked as prefetchable and mapped into the host as write-combining, but I think in this case you *don't* want it to be coherent on the endpoint side either but use a streaming mapping with explicit cache management instead. Conversely, if the endpoint side requires a coherent mapping, then I think you will want a strictly ordered (non-wc, non-frefetchable) mapping on the host side as well. It would be helpful to have actual endpoint function drivers in the kernel rather than just the test drivers to see what type of serialization you actually want for best performance on both sides. Can you give a specific example of an endpoint that you are actually interested in, maybe just one that we have a host-side device driver for in tree? > On the host side, this will mean that the host driver sees the > Prefetchable bit, and as according to: > https://docs.kernel.org/driver-api/device-io.html > The host might map the BAR using ioremap_wc(). > > Looking specifically at drivers/misc/pci_endpoint_test.c, it maps the > BARs using pci_ioremap_bar(): > https://elixir.bootlin.com/linux/v6.8/source/drivers/pci/pci.c#L252 > which will not map it using ioremap_wc(). > (But the code we have in the PCI epf-core must of course work with host > side drivers other than pci_endpoint_test.c as well.) It is to some degree architecture specific here. On powerpc and i386 with MTTRs, any prefetchable BAR will be mapped as write-combining IIRC, but on arm and arm64 it only depends on whether the host side driver uses ioremap() or ioremap_wc(). >> - The local CPU on the endpoint side may access the same buffer as >> the endpoint device. On low-end SoCs the DMA from the PCI >> endpoint is not coherent with the CPU caches, so the CPU may > > I don't follow. When doing DMA *from* the endpoint, then the DMA HW > on the EP side will read or write data to a buffer allocated on the > host side (most likely using dma_alloc_coherent()), but what does > that got to do with how the EP configures the BARs that it exposes? I meant doing DMA to the memory of the endpoint side, not the host side. DMA to the host side memory is completely separate from this question. Arnd