On 05/04/17 11:33 PM, Sagi Grimberg wrote: > >>> Note that the nvme completion queues are still on the host memory, so >>> this means we have lost the ordering between data and completions as >>> they go to different pcie targets. >> >> Hmm, in this simple up/down case with a switch, I think it might >> actually be OK. >> >> Transactions might not complete at the NVMe device before the CPU >> processes the RDMA completion, however due to the PCI-E ordering rules >> new TLPs directed to the NVMe will complete after the RMDA TLPs and >> thus observe the new data. (eg order preserving) >> >> It would be very hard to use P2P if fabric ordering is not preserved.. > > I think it still can race if the p2p device is connected with more than > a single port to the switch. > > Say it's connected via 2 legs, the bar is accessed from leg A and the > data from the disk comes via leg B. In this case, the data is heading > towards the p2p device via leg B (might be congested), the completion > goes directly to the RC, and then the host issues a read from the > bar via leg A. I don't understand what can guarantee ordering here. > > Stephen told me that this still guarantees ordering, but I honestly > can't understand how, perhaps someone can explain to me in a simple > way that I can understand. I'll say I don't have a complete understanding of this myself. However, my understanding is the completion coming from disk won't be sent toward the RC until all the all the TLPs reached leg B. Then if the RC sends TLPs to the p2p device via leg B they will be behind all the TLPs the disk sent. Or something like that. Obviously this will only work with a tree topology (which I believe is the only topology that makes sense for PCI). If you had a mesh topology, then the data could route around congestion and that would get around the ordering restrictions. Logan