> -----Original Message----- > From: Kishon Vijay Abraham I <kishon@xxxxxx> > Sent: Tuesday, October 11, 2022 6:38 AM > To: Frank Li <frank.li@xxxxxxx>; fancer.lancer@xxxxxxxxx; > helgaas@xxxxxxxxxx; sergey.semin@xxxxxxxxxxxxxxxxxxxx; kw@xxxxxxxxx; > linux-pci@xxxxxxxxxxxxxxx; manivannan.sadhasivam@xxxxxxxxxx; > ntb@xxxxxxxxxxxxxxx; jdmason@xxxxxxxx; haotian.wang@xxxxxxxxxx; > lznuaa@xxxxxxxxx; imx@xxxxxxxxxxxxxxx > Subject: [EXT] Re: [RFC] PCI EP/RC network transfer by using eDMA > > Caution: EXT Email > > Hi Frank, > > On 29/09/22 3:08 am, Frank Li wrote: > > > > ALL: > > > > Recently some important PCI EP function patch already merged. > > Especially DWC EDMA support. > > PCIe EDMA have nice feature, which can read/write all PCI host > > memory regardless EP side PCI memory map windows size. > > Pci-epf-vntb.c also merged into mainline. > > And part of vntb msi patch already merged. > > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.k > ernel.org%2Fimx%2F86mtaj7hdw.wl- > maz%40kernel.org%2FT%2F%23m35546867af07735c1070f596d653a2666f453 > c52&data=05%7C01%7CFrank.Li%40nxp.com%7C5ddd8ea32c084aadda > 9708daab7d205d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C63 > 8010851153612204%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD > AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C > &sdata=xo%2B4zJAACi1D4p7faOiUGJ7k12o0R3r8TD9ZhFbSWtM%3D&a > mp;reserved=0 > > > > Although msi can improve transfer latency, the transfer speed > > still quite slow because DMA have not supported yet. > > > > I plan continue to improve transfer speed. But I find some > > fundamental limitation at original framework, which can’t use EDMA 100% > benefits. > > By framework, you mean limitations with pci-epf-vntb right? [Frank Li] not pci-epf-vntb, it is ntb definition. NTB define one CPU just map part of memory of another CPU's memory. So at least one memory copy happen. 1. CPU1: user space buffer copy to map memory. 2. CPU2: map memory to user space buffer. NTB support Memory to Memory DMA to do 1 and 2. But still need additional Memory copy regardless by DMA or CPU. We can change ntb_transport.c to support PCIe EP's EDMA for write direction. All Read in NTB is local memory to local memory. > > After research some old thread: > > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.k > ernel.org%2Flinux-pci%2F20200702082143.25259-1- > kishon%40ti.com%2F&data=05%7C01%7CFrank.Li%40nxp.com%7C5ddd > 8ea32c084aadda9708daab7d205d%7C686ea1d3bc2b4c6fa92cd99c5c301635 > %7C0%7C0%7C638010851153768444%7CUnknown%7CTWFpbGZsb3d8eyJWI > joiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3 > 000%7C%7C%7C&sdata=qQHtTwu0Q3H02g7p%2B0H%2BQNgmD%2Btx > hreJY9KBHLT%2FSYw%3D&reserved=0 > > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.k > ernel.org%2Flinux-pci%2F9f8e596f-b601-7f97-a98a- > 111763f966d1%40ti.com%2FT%2F&data=05%7C01%7CFrank.Li%40nxp.c > om%7C5ddd8ea32c084aadda9708daab7d205d%7C686ea1d3bc2b4c6fa92cd9 > 9c5c301635%7C0%7C0%7C638010851153768444%7CUnknown%7CTWFpbGZ > sb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6M > n0%3D%7C3000%7C%7C%7C&sdata=hPPSiPtHtvAY6yHTNnngRJDnEjbfvo > nnUMIonx%2BzxhI%3D&reserved=0 > > Some RDMA document and > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub > .com%2Fntrdma%2Fntrdma- > ext&data=05%7C01%7CFrank.Li%40nxp.com%7C5ddd8ea32c084aadda9 > 708daab7d205d%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C0%7C638 > 010851153768444%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAi > LCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&a > mp;sdata=Wta9GwpyRSmnk0hlY%2FFCpK9C7fq6Djx9K08LuPcMCmc%3D&am > p;reserved=0 > > > > I think the solution, which based on haotian wang will be best one. > > why? [Frank Li] See below. > > > > ┌───────────────────────────── > ────┐ ┌──────────────┐ > > │ │ │ │ > > │ │ │ │ > > │ VirtQueue RX │ │ VirtQueue │ > > │ TX ┌──┐ │ │ TX │ > > │ ┌─────────┐ │ │ │ │ ┌──────── > ─┐ │ > > │ │ SRC LEN ├─────┐ ┌──┤ │◄────┼───┼─ > ┤ SRC LEN │ │ > > │ ├─────────┤ │ │ │ │ │ │ ├────── > ───┤ │ > > │ │ │ │ │ │ │ │ │ │ │ │ > > │ ├─────────┤ │ │ │ │ │ │ ├────── > ───┤ │ > > │ │ │ │ │ │ │ │ │ │ │ │ > > │ └─────────┘ │ │ └──┘ │ │ └───── > ────┘ │ > > │ │ │ │ │ │ > > │ RX ┌───┼──┘ TX │ │ RX │ > > │ ┌─────────┐ │ │ ┌──┐ │ │ ┌───── > ────┐ │ > > │ │ │◄┘ └────►│ ├─────┼───┼─┤ > │ │ > > │ ├─────────┤ │ │ │ │ ├──────── > ─┤ │ > > │ │ │ │ │ │ │ │ │ │ > > │ ├─────────┤ │ │ │ │ ├──────── > ─┤ │ > > │ │ │ │ │ │ │ │ │ │ > > │ └─────────┘ │ │ │ │ └──────── > ─┘ │ > > │ virtio_net └──┘ │ │ virtio_net │ > > │ Virtual PCI BUS EDMA Queue │ │ │ > > ├───────────────────────────── > ────┤ │ │ > > │ PCI EP Controller with eDMA │ │ PCI Host │ > > └───────────────────────────── > ────┘ └──────────────┘ > > > > > > Basic idea is > > 1. Both EP and host probe virtio_net driver > > 2. There are two queues, one is EP side(EQ), the other is Host side. > > 3. EP side epf driver map Host side’s queue into EP’s space. , Called > HQ. > > 4. One working thread > > a. pick one TX from EQ and RX from HQ, combine and generate > EDMA request, and put into DMA TX queue. > > b. Pick one RX from EQ and TX from HQ, combine and generate > EDMA request, and put into DMA RX queue. > > 5. EDMA done irq will mark related item in EP and HQ finished. > > > > The whole transfer is zero copied and use DMA queue. > > > > RDMA have similar idea and more coding efforts. > > My suggestion would be to pick a cleaner solution with the right > abstractions and not based on coding efforts. [Frank Li] My idea is quite similar with RDMA. I am not sure how much people Using RDMA. I may need do more research about infiniband RDMA. > > > I think Kishon Vijay Abraham I prefer use vhost, but I don’t know how > to build a queue at host side. > > Not sure what you mean by host side here. But the queue would be only on > virtio frontend (virtio-net running on PCIe RC) and PCIe EP would access > the front-end's queue. [Frank Li] we have to use two queue to maximum transfer speed. EP queue and RC queue. If we just one queue at PCIe RC, EP will done below work 1. DMA Map 2. submit one transfer to DMA queue 3. wait for DMA transfer done 4. DMA umap. The latency between 1 to 4 is quite huge. If there are two queue, EP queue, and RC queue. Kernel thread (EP -> RC example). 1. dequeue TX from EP queue, dequeue RX from RC queue. 2. put to EDMA hardware transfer queue 3. go to 1, until one of EP and RC queue empty. IRQ: Loop EDMA hardware transfer queue, mark TX EP queue item done, mark RX RC queue item done. Notified both EP and RC side. Whole flow will not additional memory copy and sync wait. All is asynced > > NTB transfer just do one directory EDMA transfer (DMA write) because > Read actually local memory > > to local memory. > > > > Any comments about overall solution? > > I would suggest you to go through the comments received on Haotian Wang > patch and suggest what changes you are proposing. [Frank Li] The major concern from you is EP side using vhost. EDMA changed the situation, see above explanation. > > Thanks, > Kishon