Re: [PATCH 4/9] dmaengine: dw-edma: HDMA: Add memory barrier before starting the DMA transfer in remote setup

Serge Semin <fancer.lancer@xxxxxxxxx> · Wed, 21 Jun 2023 18:56:49 +0300

On Wed, Jun 21, 2023 at 03:19:48PM +0200, Köry Maincent wrote:
> On Wed, 21 Jun 2023 12:45:35 +0300
> Serge Semin <fancer.lancer@xxxxxxxxx> wrote:
> 
> > > I thought that using a read will solve the issue like the gpio_nand driver
> > > (gpio_nand_dosync)   
> > 
> > AFAICS The io_sync dummy-read there is a workaround to fix the
> > bus-reordering within the SoC bus. In this case we have a PCIe bus
> > which is supposed to guarantee the strong order with the exception I
> > described above or unless there is a bug someplace in the PCIe fabric.
> > 
> > > but I didn't thought of a cache that could return the value
> > > of the read even if the write doesn't fully happen. In the case of a cache
> > > how could we know that the write is done without using a delay?   
> > 
> > MMIO mapping is platform dependent and low-level driver dependent.
> > That's why I asked many times about the platform you are using and the
> > low-level driver that probes the eDMA engine. It would be also useful
> > to know what PCIe host controller is utilized too.
> > 
> > Mainly MMIO spaces are mapped in a way to bypass the caching. But in
> > some cases it might be useful to map an MMIO space with additional
> > optimizations like Write-combining. For instance it could be
> > effectively done for the eDMA linked-list BAR mapping. Indeed why
> > would you need to send each linked-list byte/word/dword right away to
> > the device while you can combine them and send all together, then
> > flush the cache and only after that start the DMA transfer? Another
> > possible reason of the writes reordering could be in a way the PCIe
> > host outbound memory window (a memory region accesses to which are
> > translated to the PCIe bus transfers) is configured. For instance DW
> > PCIe Host controller outbound MW config CSR has a special flag which
> > enables setting a custom PCIe bus TLPs (packets) attribute. As I
> > mentioned above that attribute can affect the TLPs order: make it
> > relaxed or ID-based.
> > 
> > Of course we can't reject a possibility of having some delays hidden
> > inside your device which may cause writes to the internal memory
> > landing after the writes to the CSRs. But that seems too exotic to be
> > considered as the real one for sure until the alternatives are
> > thoroughly checked.
> > 
> > What I was trying to say that your problem can be caused by some much
> > more frequently met reason. If I were you I would have checked them
> > first and only then considered a workaround like you suggest.
> 

> Thanks for you detailed answer, this was instructive.
> I will come back with more information if TLP flags are set.
> FYI the PCIe board I am currently working with is the one from Brainchip:
> Here is the driver:
> https://github.com/Brainchip-Inc/akida_dw_edma

I've glanced at the driver a bit:

1. Nothing criminal I've noticed in the way the BARs are mapped. It's
done as it's normally done. pcim_iomap_regions() is supposed to map
with no additional optimization. So the caching seems irrelevant
in this case.

2. The probe() method performs some device iATU config:
akida_1500_setup_iatu() and akida_1000_setup_iatu(). I would have a
closer look at the way the inbound MWs setup is done.

3. akida_1000_iatu_conf_table contains comments about the APB bus. If
it's an internal device bus and both LPDDR and eDMA are accessible
over the same bus, then the re-ordering may happen there. If APB means
the well known Advanced Peripheral Bus, then it's a quite slow bus
with respect to the system interconnect and PCIe buses. If eDMA regs
and LL-memory buses are different then the last write to the LL-memory
might be indeed still pending while the doorbells update arrives.
Sending a dummy read to the LL-memory stalls the program execution
until a response arrive (PCIe MRd TLPs are non-posted - "send and wait
for response") which happens only after the last write to the
LL-memory finishes. That's probably why your fix with the dummy-read
works and why the delay you noticed is quite significant (4us).
Though it looks quite strange to put LPDDR on such slow bus.

4. I would have also had a closer look at the way the outbound MW is
configured in your PCIe host controller (whether it enables some
optimizations like Relaxed ordering and ID-based ordering).

In anyway I would have got in touch with the FPGA designers whether
any of my suppositions correct (especially regarding 3.).

-Serge(y)

> 
> Köry