On Wed, Jun 21, 2023 at 03:19:48PM +0200, Köry Maincent wrote: > On Wed, 21 Jun 2023 12:45:35 +0300 > Serge Semin <fancer.lancer@xxxxxxxxx> wrote: > > > > I thought that using a read will solve the issue like the gpio_nand driver > > > (gpio_nand_dosync) > > > > AFAICS The io_sync dummy-read there is a workaround to fix the > > bus-reordering within the SoC bus. In this case we have a PCIe bus > > which is supposed to guarantee the strong order with the exception I > > described above or unless there is a bug someplace in the PCIe fabric. > > > > > but I didn't thought of a cache that could return the value > > > of the read even if the write doesn't fully happen. In the case of a cache > > > how could we know that the write is done without using a delay? > > > > MMIO mapping is platform dependent and low-level driver dependent. > > That's why I asked many times about the platform you are using and the > > low-level driver that probes the eDMA engine. It would be also useful > > to know what PCIe host controller is utilized too. > > > > Mainly MMIO spaces are mapped in a way to bypass the caching. But in > > some cases it might be useful to map an MMIO space with additional > > optimizations like Write-combining. For instance it could be > > effectively done for the eDMA linked-list BAR mapping. Indeed why > > would you need to send each linked-list byte/word/dword right away to > > the device while you can combine them and send all together, then > > flush the cache and only after that start the DMA transfer? Another > > possible reason of the writes reordering could be in a way the PCIe > > host outbound memory window (a memory region accesses to which are > > translated to the PCIe bus transfers) is configured. For instance DW > > PCIe Host controller outbound MW config CSR has a special flag which > > enables setting a custom PCIe bus TLPs (packets) attribute. As I > > mentioned above that attribute can affect the TLPs order: make it > > relaxed or ID-based. > > > > Of course we can't reject a possibility of having some delays hidden > > inside your device which may cause writes to the internal memory > > landing after the writes to the CSRs. But that seems too exotic to be > > considered as the real one for sure until the alternatives are > > thoroughly checked. > > > > What I was trying to say that your problem can be caused by some much > > more frequently met reason. If I were you I would have checked them > > first and only then considered a workaround like you suggest. > > Thanks for you detailed answer, this was instructive. > I will come back with more information if TLP flags are set. > FYI the PCIe board I am currently working with is the one from Brainchip: > Here is the driver: > https://github.com/Brainchip-Inc/akida_dw_edma I've glanced at the driver a bit: 1. Nothing criminal I've noticed in the way the BARs are mapped. It's done as it's normally done. pcim_iomap_regions() is supposed to map with no additional optimization. So the caching seems irrelevant in this case. 2. The probe() method performs some device iATU config: akida_1500_setup_iatu() and akida_1000_setup_iatu(). I would have a closer look at the way the inbound MWs setup is done. 3. akida_1000_iatu_conf_table contains comments about the APB bus. If it's an internal device bus and both LPDDR and eDMA are accessible over the same bus, then the re-ordering may happen there. If APB means the well known Advanced Peripheral Bus, then it's a quite slow bus with respect to the system interconnect and PCIe buses. If eDMA regs and LL-memory buses are different then the last write to the LL-memory might be indeed still pending while the doorbells update arrives. Sending a dummy read to the LL-memory stalls the program execution until a response arrive (PCIe MRd TLPs are non-posted - "send and wait for response") which happens only after the last write to the LL-memory finishes. That's probably why your fix with the dummy-read works and why the delay you noticed is quite significant (4us). Though it looks quite strange to put LPDDR on such slow bus. 4. I would have also had a closer look at the way the outbound MW is configured in your PCIe host controller (whether it enables some optimizations like Relaxed ordering and ID-based ordering). In anyway I would have got in touch with the FPGA designers whether any of my suppositions correct (especially regarding 3.). -Serge(y) > > Köry