On Fri, Oct 13, 2017 at 11:22:21AM -0700, Dan Williams wrote: > > So, who should be responsible for MR coherency? Today we say the MPI > > is responsible. But we can't really expect the MPI > > to hook SIGIO and somehow try to reverse engineer what MRs are > > impacted from a FD that may not even still be open. > > Ok, that's good insight that I didn't have. Userspace needs more help > than just an fd notification. Glad to help! > > I think, if you want to build a uAPI for notification of MR lease > > break, then you need show how it fits into the above software model: > > - How it can be hidden in a RDMA specific library > > So, here's a strawman can ibv_poll_cq() start returning ibv_wc_status > == IBV_WC_LOC_PROT_ERR when file coherency is lost. This would make > the solution generic across DAX and non-DAX. What's you're feeling for > how well applications are prepared to deal with that status return? Stuffing an entry into the CQ is difficult. The CQ is in user memory and it is DMA'd from the HCA for several pieces of hardware, so the kernel can't just stuff something in there. It can be done with HW support by having the HCA DMA it via an exception path or something, but even then, you run into questions like CQ overflow and accounting issues since it is not ment for this. So, you need a side channel of some kind, either in certain drivers or generically.. > > - How lease break can be done hitlessly, so the library user never > > needs to know it is happening or see failed/missed transfers > > iommu redirect should be hit less and behave like the page cache case > where RDMA targets pages that are no longer part of the file. Yes, if the iommu can be fenced properly it sounds doable. > > - Whatever fast path checking is needed does not kill performance > > What do you consider a fast path? I was assuming that memory > registration is a slow path, and iommu operations are asynchronous so > should not impact performance of ongoing operations beyond typical > iommu overhead. ibv_poll_cq() and ibv_post_send() would be a fast path. Where this struggled before is in creating a side channel you also now have to check that side channel, and checking it at high performance is quite hard.. Even quiecing things to be able to tear down the MR has performance implications on post send... Now that I see this whole thing in this light it seem so very similar to the MPI driven user space mmu notifications ideas and has similar challenges. FWIW, RDMA banged its head on this issue for 10 years and it was ODP that emerged as the solution. One option might be to use an async event notification 'MR de-coherence' and rely on a main polling loop to catch it. This is good enough for dax becaue the lease-requestor would wait until the async event was processed. It would also be acceptable for the general MPI case too, but only if this lease concept was wider than just DAX, eg a MR leases a peice of VMA, and if anything anyhow changes that VMA (eg munamp, mmap, mremap, etc) then it has to wait from the MR to release the lease. ie munmap would block until the async event is processed. ODP-light in userspace, essentially. IIRC this sort of suggestion was never explored, something like: poll(fd) ibv_read_async_event(fd) if (event == MR_DECOHERENCE) { queice_network(); ibv_restore_mr(mr); restore_network(); } The implemention of ibv_restore_mr would have to make a new MR that pointed to the same virtual memory addresses, but was backed by the *new* physical pages. This means it has to unblock the lease, and wait for the lease requestor to complete executing. Jason