On Fri, Oct 20, 2017 at 6:05 AM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > On Thu, 2017-10-19 at 19:40 -0700, Dan Williams wrote: >> get_user_pages() pins file backed memory pages for access by dma >> devices. However, it only pins the memory pages not the page-to-file >> offset association. If a file is truncated the pages are mapped out of >> the file and dma may continue indefinitely into a page that is owned by >> a device driver. This breaks coherency of the file vs dma, but the >> assumption is that if userspace wants the file-space truncated it does >> not matter what data is inbound from the device, it is not relevant >> anymore. >> >> The assumptions of the truncate-page-cache model are broken by DAX where >> the target DMA page *is* the filesystem block. Leaving the page pinned >> for DMA, but truncating the file block out of the file, means that the >> filesytem is free to reallocate a block under active DMA to another >> file! >> >> Here are some possible options for fixing this situation ('truncate' and >> 'fallocate(punch hole)' are synonymous below): >> >> 1/ Fail truncate while any file blocks might be under dma >> >> 2/ Block (sleep-wait) truncate while any file blocks might be under >> dma >> >> 3/ Remap file blocks to a "lost+found"-like file-inode where >> dma can continue and we might see what inbound data from DMA was >> mapped out of the original file. Blocks in this file could be >> freed back to the filesystem when dma eventually ends. >> >> 4/ Disable dax until option 3 or another long term solution has been >> implemented. However, filesystem-dax is still marked experimental >> for concerns like this. >> >> Option 1 will throw failures where userspace has never expected them >> before, option 2 might hang the truncating process indefinitely, and >> option 3 requires per filesystem enabling to remap blocks from one inode >> to another. Option 2 is implemented in this patch for the DAX path with >> the expectation that non-transient users of get_user_pages() (RDMA) are >> disallowed from setting up dax mappings and that the potential delay >> introduced to the truncate path is acceptable compared to the response >> time of the page cache case. This can only be seen as a stop-gap until >> we can solve the problem of safely sequestering unallocated filesystem >> blocks under active dma. >> > > FWIW, I like #3 a lot more than #2 here. I get that it's quite a bit > more work though, so no objection to this as a stop-gap fix. I agree, but it needs quite a bit more thought and restructuring of the truncate path. I also wonder how we reclaim those stranded filesystem blocks, but a first approximation is wait for the administrator to delete them or auto-delete them at the next mount. XFS seems well prepared to reflink-swap these DMA blocks around, but I'm not sure about EXT4. > > >> The solution introduces a new FL_ALLOCATED lease to pin the allocated >> blocks in a dax file while dma might be accessing them. It behaves >> identically to an FL_LAYOUT lease save for the fact that it is >> immediately sheduled to be reaped, and that the only path that waits for >> its removal is the truncate path. We can not reuse FL_LAYOUT directly >> since that would deadlock in the case where userspace did a direct-I/O >> operation with a target buffer backed by an mmap range of the same file. >> >> Credit / inspiration for option 3 goes to Dave Hansen, who proposed >> something similar as an alternative way to solve the problem that >> MAP_DIRECT was trying to solve. >> >> Cc: Jan Kara <jack@xxxxxxx> >> Cc: Jeff Moyer <jmoyer@xxxxxxxxxx> >> Cc: Dave Chinner <david@xxxxxxxxxxxxx> >> Cc: Matthew Wilcox <mawilcox@xxxxxxxxxxxxx> >> Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx> >> Cc: "Darrick J. Wong" <darrick.wong@xxxxxxxxxx> >> Cc: Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx> >> Cc: Jeff Layton <jlayton@xxxxxxxxxxxxxxx> >> Cc: "J. Bruce Fields" <bfields@xxxxxxxxxxxx> >> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> >> Reported-by: Christoph Hellwig <hch@xxxxxx> >> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx> >> --- >> fs/Kconfig | 1 >> fs/dax.c | 188 +++++++++++++++++++++++++++++++++++++++++++++++++++ >> fs/locks.c | 17 ++++- >> include/linux/dax.h | 23 ++++++ >> include/linux/fs.h | 22 +++++- >> mm/gup.c | 27 ++++++- >> 6 files changed, 268 insertions(+), 10 deletions(-) >> >> diff --git a/fs/Kconfig b/fs/Kconfig >> index 7aee6d699fd6..a7b31a96a753 100644 >> --- a/fs/Kconfig >> +++ b/fs/Kconfig >> @@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig" >> config FS_DAX >> bool "Direct Access (DAX) support" >> depends on MMU >> + depends on FILE_LOCKING >> depends on !(ARM || MIPS || SPARC) >> select FS_IOMAP >> select DAX >> diff --git a/fs/dax.c b/fs/dax.c >> index b03f547b36e7..e0a3958fc5f2 100644 >> --- a/fs/dax.c >> +++ b/fs/dax.c >> @@ -22,6 +22,7 @@ >> #include <linux/genhd.h> >> #include <linux/highmem.h> >> #include <linux/memcontrol.h> >> +#include <linux/file.h> >> #include <linux/mm.h> >> #include <linux/mutex.h> >> #include <linux/pagevec.h> >> @@ -1481,3 +1482,190 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size, >> } >> } >> EXPORT_SYMBOL_GPL(dax_iomap_fault); >> + >> +enum dax_lease_flags { >> + DAX_LEASE_PAGES, >> + DAX_LEASE_BREAK, >> +}; >> + >> +struct dax_lease { >> + struct page **dl_pages; >> + unsigned long dl_nr_pages; >> + unsigned long dl_state; >> + struct file *dl_file; >> + atomic_t dl_count; >> + /* >> + * Once the lease is taken and the pages have references we >> + * start the reap_work to poll for lease release while acquiring >> + * fs locks that synchronize with truncate. So, either reap_work >> + * cleans up the dax_lease instances or truncate itself. >> + * >> + * The break_work sleepily polls for DMA completion and then >> + * unlocks/removes the lease. >> + */ >> + struct delayed_work dl_reap_work; >> + struct delayed_work dl_break_work; >> +}; >> + >> +static void put_dax_lease(struct dax_lease *dl) >> +{ >> + if (atomic_dec_and_test(&dl->dl_count)) { >> + fput(dl->dl_file); >> + kfree(dl); >> + } >> +} > > Any reason not to use the new refcount_t type for dl_count? Seems like a > good place for it. I'll take a look. >> + >> +static void dax_lease_unlock_one(struct work_struct *work) >> +{ >> + struct dax_lease *dl = container_of(work, typeof(*dl), >> + dl_break_work.work); >> + unsigned long i; >> + >> + /* wait for the gup path to finish recording pages in the lease */ >> + if (!test_bit(DAX_LEASE_PAGES, &dl->dl_state)) { >> + schedule_delayed_work(&dl->dl_break_work, HZ); >> + return; >> + } >> + >> + /* barrier pairs with dax_lease_set_pages() */ >> + smp_mb__after_atomic(); >> + >> + /* >> + * If we see all pages idle at least once we can remove the >> + * lease. If we happen to race with someone else taking a >> + * reference on a page they will have their own lease to protect >> + * against truncate. >> + */ >> + for (i = 0; i < dl->dl_nr_pages; i++) >> + if (page_ref_count(dl->dl_pages[i]) > 1) { >> + schedule_delayed_work(&dl->dl_break_work, HZ); >> + return; >> + } >> + vfs_setlease(dl->dl_file, F_UNLCK, NULL, (void **) &dl); >> + put_dax_lease(dl); >> +} >> + >> +static void dax_lease_reap_all(struct work_struct *work) >> +{ >> + struct dax_lease *dl = container_of(work, typeof(*dl), >> + dl_reap_work.work); >> + struct file *file = dl->dl_file; >> + struct inode *inode = file_inode(file); >> + struct address_space *mapping = inode->i_mapping; >> + >> + if (mapping->a_ops->dax_flush_dma) { >> + mapping->a_ops->dax_flush_dma(inode); >> + } else { >> + /* FIXME: dax-filesystem needs to add dax-dma support */ >> + break_allocated(inode, true); >> + } >> + put_dax_lease(dl); >> +} >> + >> +static bool dax_lease_lm_break(struct file_lock *fl) >> +{ >> + struct dax_lease *dl = fl->fl_owner; >> + >> + if (!test_and_set_bit(DAX_LEASE_BREAK, &dl->dl_state)) { >> + atomic_inc(&dl->dl_count); >> + schedule_delayed_work(&dl->dl_break_work, HZ); >> + } >> + > > I haven't gone over this completely, but what prevents you from doing a > 0->1 transition on the dl_count here, and possibly doing a use-after > free? > > Ahh ok...I guess we know that we hold a reference since this is on the > flc_lease list? Fair enough. Still, might be worth a comment there as to > why that's safe. Right, we hold a reference count at the beginning of time that is only dropped when the lease is unlocked. If the break happens before unlock we take this reference while the break_work is running. I'll add this as a comment.