Re: [RFC 6/8] nvmet: Be careful about using iomem accesses when dealing with p2pmem

Logan Gunthorpe <logang@xxxxxxxxxxxx> · Thu, 6 Apr 2017 10:02:04 -0600

On 05/04/17 11:33 PM, Sagi Grimberg wrote:
> 
>>> Note that the nvme completion queues are still on the host memory, so
>>> this means we have lost the ordering between data and completions as
>>> they go to different pcie targets.
>>
>> Hmm, in this simple up/down case with a switch, I think it might
>> actually be OK.
>>
>> Transactions might not complete at the NVMe device before the CPU
>> processes the RDMA completion, however due to the PCI-E ordering rules
>> new TLPs directed to the NVMe will complete after the RMDA TLPs and
>> thus observe the new data. (eg order preserving)
>>
>> It would be very hard to use P2P if fabric ordering is not preserved..
> 
> I think it still can race if the p2p device is connected with more than
> a single port to the switch.
> 
> Say it's connected via 2 legs, the bar is accessed from leg A and the
> data from the disk comes via leg B. In this case, the data is heading
> towards the p2p device via leg B (might be congested), the completion
> goes directly to the RC, and then the host issues a read from the
> bar via leg A. I don't understand what can guarantee ordering here.
> 
> Stephen told me that this still guarantees ordering, but I honestly
> can't understand how, perhaps someone can explain to me in a simple
> way that I can understand.

I'll say I don't have a complete understanding of this myself. However,
my understanding is the completion coming from disk won't be sent toward
the RC until all the all the TLPs reached leg B. Then if the RC sends
TLPs to the p2p device via leg B they will be behind all the TLPs the
disk sent. Or something like that. Obviously this will only work with a
tree topology (which I believe is the only topology that makes sense for
PCI). If you had a mesh topology, then the data could route around
congestion and that would get around the ordering restrictions.

Logan