On 10/09/2014 06:25 PM, Matthew Wilcox wrote: > On Thu, Oct 09, 2014 at 12:10:38PM +1100, Dave Chinner wrote: >> On Wed, Oct 08, 2014 at 03:05:23PM -0400, Matthew Wilcox wrote: >>> >>> One of the things on my todo list is making O_DIRECT work to a >>> memory-mapped direct_access file. >> >> I don't understand the motivation or the use case: O_DIRECT is >> purely for bypassing the page cache, and DAX already bypasses the >> page cache. What difference is there between the DAX read/write >> path and a DAX-based O_DIRECT IO path, and why doesn't just ignoring >> O_DIRECT for DAX enabled filesystems simply do what you need? > > There are two filesystems involved ... if both (or neither!) are DAX, > everything's fine. The problem comes when you do things this way around: > > int cachefd = open("/dax/cache", O_RDWR); > int datafd = open("/nfs/bigdata", O_RDWR | O_DIRECT); > void *cache = mmap(NULL, 1024 * 1024 * 1024, PROT_READ | PROT_WRITE, > MAP_SHARED, cachefd, 0); > read(datafd, cache, 1024 * 1024); > This BTW works today. What happens is that get_user_pages() fails, so directIO of NFS above fails and the VFS will just revert to buffered IO which will work just fine with a simple memcpy to/from NFS's page-cache > The non-DAX filesystem needs to pin pages from the DAX filesystem while > they're under I/O. > > > Another attempt to solve this problem might be to turn the O_DIRECT > read into a read into a page of DRAM, followed by a copy from DRAM > to PMEM. Conversely, writes could be done as a copy to DRAM followed > by a page-based write. > So that's kind of stupid, why not let it be a @datafd's page cache like what actually happen today? > > You also elided the paragraphs where I point out that this is an example > of a more general problem; there really are people who want to do RDMA > to DAX memory (the HPC crowd, of course), I do not yet see how in your proposal you can ever do RDMA without my page-structs-for-pmem patch? This was exactly my motivation to enable this, and to enable direct block layer access to pmem. And Yes once the page-struct ref is held say by RDMA, it must be left unallocateable until its refcount drops. This is exactly what we did in our pmem+pages based FS. Today RDMA and/or any other subsystem access is not possible, and does not have this problem. > And we need to not open up > security holes when enabling that. Since it's a potentially long-duration > and bi-directional mapping, the copy solution isn't going to work here I agree we should be careful to not open any holes. If done right it should be good. A pmem aware FS should monitor the reference count of the pmem-page-struct and if still held must not recycle that block to free-store but keep it held until the reference drops. It is quite simple really. That said a sane application should not have this problem. There should not be a possibility for the RDMA to access loosely coupled pages that belongs to nothing. (That used to belong to an mmaped file). For example taking some kind of flock on the file will make the truncate wait until file is closed by app. And app does not close until RDMA mapping is closed. Otherwise what is the point of this app? I agree that exposing pmem to external subsytems, unlike today, might pose new challenges. But these are doable. On top of Matthew's DAX patches, there can be a simple API established with the FS where dax_truncate_page can communicate that a certain block must not yet be returned to free-store after the truncate, and will be returned to free-store later on. Thanks Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html