Re: [PATCH v3 9/9] PCI: endpoint: Set prefetch when allocating memory for 64-bit BARs

Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx> · Mon, 18 Mar 2024 10:00:58 +0530

On Fri, Mar 15, 2024 at 06:29:52PM +0100, Arnd Bergmann wrote:
> On Fri, Mar 15, 2024, at 07:44, Manivannan Sadhasivam wrote:
> > On Wed, Mar 13, 2024 at 11:58:01AM +0100, Niklas Cassel wrote:
> >> "Generally only 64-bit BARs are good candidates, since only Legacy
> >> Endpoints are permitted to set the Prefetchable bit in 32-bit BARs,
> >> and most scalable platforms map all 32-bit Memory BARs into
> >> non-prefetchable Memory Space regardless of the Prefetchable bit value."
> >> 
> >> "For a PCI Express Endpoint, 64-bit addressing must be supported for all
> >> BARs that have the Prefetchable bit Set. 32-bit addressing is permitted
> >> for all BARs that do not have the Prefetchable bit Set."
> >> 
> >> "Any device that has a range that behaves like normal memory should mark
> >> the range as prefetchable. A linear frame buffer in a graphics device is
> >> an example of a range that should be marked prefetchable."
> >> 
> >> The PCIe spec tells us that we should have the prefetchable bit set for
> >> 64-bit BARs backed by "normal memory". The backing memory that we allocate
> >> for a 64-bit BAR using pci_epf_alloc_space() (which calls
> >> dma_alloc_coherent()) is obviously "normal memory".
> >> 
> >
> > I'm not sure this is correct. Memory returned by 'dma_alloc_coherent' is not the
> > 'normal memory' but rather 'consistent/coherent memory'. Here the question is,
> > can the memory returned by dma_alloc_coherent() be prefetched or write-combined
> > on all architectures.
> >
> > I hope Arnd can answer this question.
> 
> I think there are three separate questions here when talking about
> a scenario where a PCI master accesses memory behind a PCI endpoint:
> 
> - The CPU on the host side ususally uses ioremap() for mapping
>   the PCI BAR of the device. If the BAR is marked as prefetchable,
>   we usually allow mapping it using ioremap_wc() for write-combining
>   or ioremap_wt() for a write-through mappings that allow both
>   write-combining and prefetching. On some architectures, these
>   all fall back to normal register mappings which do none of these.
>   If it uses write-combining or prefetching, the host side driver
>   will need to manually serialize against concurrent access from
>   the endpoint side.
> 
> - The endpoint device accessing a buffer in memory is controlled
>   by the endpoint driver and may decide to prefetch data into a
>   local cache independent of the other two. I don't know if any
>   of the suppored endpoint devices actually do that. A prefetch
>   from the PCI host side would appear as a normal transaction here.
> 
> - The local CPU on the endpoint side may access the same buffer as
>   the endpoint device. On low-end SoCs the DMA from the PCI
>   endpoint is not coherent with the CPU caches, so the CPU may
>   need to map it as uncacheable to allow data consistency with
>   a the CPU on the PCI host side. On higher-end SoCs (e.g. most
>   non-ARM ones) DMA is coherent with the caches, so the CPU
>   on the endpoint side may map the buffer as cached and
>   still be coherent with a CPU on the PCI host side that has
>   mapped it with ioremap().
> 

Thanks Arnd for the reply.

But I'm not sure I got the answer I was looking for. So let me rephrase my
question a bit.

For BAR memory, PCIe spec states that,

'A PCI Express Function requesting Memory Space through a BAR must set the BAR's
Prefetchable bit unless the range contains locations with read side effects or
locations in which the Function does not tolerate write merging'

So here, spec refers the backing memory allocated on the endpoint side as the
'range' i.e, the BAR memory allocated on the host that gets mapped on the
endpoint.

Currently on the endpoint side, we use dma_alloc_coherent() to allocate the
memory for each BAR and map it using iATU.

So I want to know if the memory range allocated in the endpoint through
dma_alloc_coherent() satisfies the above two conditions in PCIe spec on all
architectures:

1. No Read side effects
2. Tolerates write merging

I believe the reason why we are allocating the coherent memory on the endpoint
first up is not all PCIe controllers are DMA coherent as you said above.

- Mani

-- 
மணிவண்ணன் சதாசிவம்