On Thu, Oct 12, 2017 at 11:27 AM, Jason Gunthorpe <jgunthorpe@xxxxxxxxxxxxxxxxxxxx> wrote: > On Tue, Oct 10, 2017 at 01:17:26PM -0700, Dan Williams wrote: > >> Also keep in mind that what triggers the lease break is another >> application trying to write or punch holes in a file that is mapped >> for RDMA. So, if the hardware can't handle the iommu mapping getting >> invalidated asynchronously and the application can't react in the >> lease break timeout period then the administrator should arrange for >> the file to not be written or truncated while it is mapped. > > That makes sense, but why not return ENOSYS or something to the app > trying to alter the file if the RDMA hardware can't support this > instead of having the RDMA app deal with this lease break weirdness? That's where I started, an inode flag that said "hands off, this file is busy", but Christoph pointed out that we should reuse the same mechanisms that pnfs is using. The pnfs protection scheme uses file leases, and once the kernel decides that a lease needs to be broken / layout needs to be recalled there is no stopping it, only delaying. >> It's already the case that get_user_pages() does not lock down file >> associations, so if your application is contending with these types of >> file changes it likely already has a problem keeping transactions in >> sync with the file state even without DAX. > > Yes, things go weird in non-ODP RDMA cases like this.. > > Also, just to clear, I would expect an app using the SIGIO interface > to basically halt ongoing RDMA, wait for MRs to become unused locally > and remotely, destroy the MRs, then somehow, establish new MRs that > cover the same logical map (eg what ODP would do transparently) after > the lease breaker has made their changes, then restart their IO. > > Does your SIGIO approach have a race-free way to do that last steps? After the SIGIO that's becomes a userspace / driver problem to quiesce the I/O... However, chatting this over with a few more people I have an alternate solution that effectively behaves the same as how non-ODP hardware handles this case of hole punch / truncation today. So, today if this scenario happens on a page-cache backed mapping, the file blocks are unmapped and the RDMA continues into pinned pages that are no longer part of the file. We can achieve the same thing with the iommu, just re-target the I/O into memory that isn't part of the file. That way hardware does not see I/O errors and the DAX data consistency model is no worse than the page-cache case. >> > So, not being able to support DAX on certain RDMA hardware is not >> > an unreasonable situation in our space. >> >> That makes sense, but it still seems to me that this proposed solution >> allows more than enough ways to avoid that worst case scenario where >> hardware reacts badly to iommu invalidation. > > Yes, although I am concerned that returning PCI-E errors is such an > unusual and untested path for some of our RDMA drivers that they may > malfunction badly... > > Again, going back to the question of who would ever use this, I would > be very relucant to deploy a production configuration relying on the iommu > invalidate or SIGIO techniques, when ODP HW is available and works > flawlessly. I don't think it is reasonable to tell people you need to throw away your old hardware just because you want to target a DAX mapping. >> be blacklisted from supporting DAX altogether. In other words this is >> a starting point to incrementally enhance or disable specific drivers, >> but with the assurance that the kernel can always do the safe thing >> when / if the driver is missing a finer grained solution. > > Seems reasonable.. I think existing HW will have an easier time adding > invalidate, while new hardware really should implement ODP. > Yeah, so if we go with 'remap' instead of 'invalidate' does that address your concerns? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html