On Thu, Feb 11, 2016 at 2:46 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Thu, Feb 11, 2016 at 12:58:38PM -0800, Dan Williams wrote: >> On Thu, Feb 11, 2016 at 12:46 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >> [..] >> >> It seems to me we need to modify the >> >> metadata i/o paths to bypass the page cache, >> > >> > XFS doesn't use the block device page cache for it's metadata - it >> > has it's own internal metadata cache structures and uses get_pages >> > or heap memory to back it's metadata. But that doesn't make mixing >> > DAX and pages in the block device mapping tree sane. >> > >> > What you are missing here is that the underlying architecture of >> > journalling filesystems mean they can't use DAX for their metadata. >> > Modifications have to be buffered, because they have to be written >> > to the journal first before they are written back in place. IOWs, we >> > need to buffer changes in volatile memory for some time, and that >> > means we can't use DAX during transactional modifications. >> > >> > And to put the final nail in that coffin, metadata in XFS can be >> > discontiguous multi-block objects - in those situations we vmap the >> > underlying pages so they appear to the code to be a contiguous >> > buffer, and that's something we can't do with DAX.... >> >> Sorry, I wasn't clear when I said "bypass page cache" I meant a >> solution similar to commit d1a5f2b4d8a1 "block: use DAX for partition >> table reads". > > So there's already bandaids to prevent bad shit from happening in > the block layer, let alone when we consider all the ways that > userspace can screw this all up. > >> However, I suspect that is broken if the filesystem is not ready >> to see a new page allocated for every I/O. I assume one >> thread will want to insert a page in the radix for another thread >> to find/manipulate before metadata gets written back to storage. > > Right, you can't do that, especially as the struct page has a 1-1 > relationship with the bufferhead that is attached to it as the > bufferhead carries the filesystem state for the given cached page. > >> >> or teach the fsync code how to flush populated data pages out >> >> of the radix. >> > >> > That doesn't solve the problem. Filesystems free and reallocate >> > filesystem blocks without intermediate block device mapping >> > invalidation calls, so what is one minute a data block accessed >> > by DAX may become a metadata block that accessed via buffered >> > IO. It all goes to crap very quickly.... >> > >> > However, I'd say fsync is not the place to address this. This >> > block device cache aliasing issue is supposed to be what >> > unmap_underlying_metadata() solves, right? >> >> I'll take a look at this. Right now I'm trying to implement the >> "clear block-device-inode S_DAX on fs mount" approach. My concern >> though is that we need to disable block device mmap while a >> filesystem is mounted... > > /me chokes on his coffee. > > When did mmaping the block device behind the back of a mounted > fileystem become a valid use case? It's not supported for normal > block devices and for the same reasons it won't be supported for DAX > enabled block devices, either. i.e. I'm going to tell anyone who has > an application that does this to go and take a hike when (not if!) > they report filesystem corruption problems. Right, but we need to not confuse the fsync code regardless of how bad of an idea this is ::-). >> Maybe I don't need to worry because it's already the case that a >> mmap of the raw device may not see the most up to date data for a >> file that has dirty fs-page-cache data. > > It goes both ways. What happens if mkfs or fsck modifies the > block device via mmap+DAX and then the filesystem mounts the block > device and tries to read that metadata via the block device page > cache? > > Quite frankly, DAX on the block device is a can of worms we really > don't need to deal with right now. IMO it's a solution looking for a > problem to solve, Virtualization use cases want to give large ranges to guest-VMs, and it is currently the only way to reliably get 1GiB mappings. > the "default to on" policy is wrong (DAX is > opt-in, not opt-out) and given this we should turn it off until > we've solved the more important problems we need to solve. i.e. We > need to concentrate on getting data integrity working correctly > first, then address the cache aliasing issues, then address the > "safe access" issues, and then we can re-introduce block device DAX > access... Agreed. Note that the "default-on policy" came from commit bbab37ddc20b "block: Add support for DAX reads/writes to block devices" way back in 4.2. We're just now noticing. Credit Ross for good sanity checking. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html