On Mon, Oct 30, 2017 at 4:20 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Mon, Oct 30, 2017 at 09:38:07AM +0100, Jan Kara wrote: >> Hi, >> >> On Mon 30-10-17 13:00:23, Dave Chinner wrote: >> > On Sun, Oct 29, 2017 at 04:46:44PM -0700, Dan Williams wrote: >> > > Coming back to this since Dave has made clear that new locking to >> > > coordinate get_user_pages() is a no-go. >> > > >> > > We can unmap to force new get_user_pages() attempts to block on the >> > > per-fs mmap lock, but if punch-hole finds any elevated pages it needs >> > > to drop the mmap lock and wait. We need this lock dropped to get >> > > around the problem that the driver will not start to drop page >> > > references until it has elevated the page references on all the pages >> > > in the I/O. If we need to drop the mmap lock that makes it impossible >> > > to coordinate this unlock/retry loop within truncate_inode_pages_range >> > > which would otherwise be the natural place to land this code. >> > > >> > > Would it be palatable to unmap and drain dma in any path that needs to >> > > detach blocks from an inode? Something like the following that builds >> > > on dax_wait_dma() tried to achieve, but does not introduce a new lock >> > > for the fs to manage: >> > > >> > > retry: >> > > per_fs_mmap_lock(inode); >> > > unmap_mapping_range(mapping, start, end); /* new page references >> > > cannot be established */ >> > > if ((dax_page = dax_dma_busy_page(mapping, start, end)) != NULL) { >> > > per_fs_mmap_unlock(inode); /* new page references can happen, >> > > so we need to start over */ >> > > wait_for_page_idle(dax_page); >> > > goto retry; >> > > } >> > > truncate_inode_pages_range(mapping, start, end); >> > > per_fs_mmap_unlock(inode); >> > >> > These retry loops you keep proposing are just bloody horrible. They >> > are basically just a method for blocking an operation until whatever >> > condition is preventing the invalidation goes away. IMO, that's an >> > ugly solution no matter how much lipstick you dress it up with. >> > >> > i.e. the blocking loops mean the user process is going to be blocked >> > for arbitrary lengths of time. That's not a solution, it's just >> > passing the buck - now the userspace developers need to work around >> > truncate/hole punch being randomly blocked for arbitrary lengths of >> > time. >> >> So I see substantial difference between how you and Christoph think this >> should be handled. Christoph writes in [1]: >> >> The point is that we need to prohibit long term elevated page counts >> with DAX anyway - we can't just let people grab allocated blocks forever >> while ignoring file system operations. For stage 1 we'll just need to >> fail those, and in the long run they will have to use a mechanism >> similar to FL_LAYOUT locks to deal with file system allocation changes. >> >> So Christoph wants to block truncate until references are released, forbid >> long term references until userspace acquiring them supports some kind of >> lease-breaking. OTOH you suggest truncate should just proceed leaving >> blocks allocated until references are released. > > I don't see what I'm suggesting is a solution to long term elevated > page counts. Just something that can park extents until layout > leases are broken and references released. That's a few tens of > seconds at most. > >> We cannot have both... I'm leaning more towards the approach >> Christoph suggests as it puts the burned to the place which is >> causing it - the application having long term references - and >> applications needing this should be sufficiently rare that we >> don't have to devise a general mechanism in the kernel for this. > > I have no problems with blocking truncate forever if that's the > desired solution for an elevated page count due to a DMA reference > to a page. But that has absolutely nothing to do with the filesystem > though - it's a page reference vs mapping invalidation problem, not > a filesystem/inode problem. > > Perhaps pages with active DAX DMA mapping references need a page > flag to indicate that invalidation must block on the page similar to > the writeback flag... We effectively already have this flag since pages where is_zone_device_page() == true can only have their reference count elevated by get_user_pages(). More importantly we can not block invalidation on an elevated page count because that page count may never drop until all references have been acquired. I.e. iov_iter_get_pages() grabs a range of pages potentially across multiple vmas and does not drop any references in the range until all pages have had their count elevated. >> If the solution Christoph suggests is acceptable to you, I think >> we should first write a patch to forbid acquiring long term >> references to DAX blocks. On top of that we can implement >> mechanism to block truncate while there are short term references >> pending (and for that retry loops would be IMHO acceptable). > > The problem with retry loops is that they are making a mess of an > already complex set of locking contraints on the indoe IO path. It's > rapidly descending into an unmaintainable mess - falling off the > locking cliff only make sthe code harder to maintain - please look > for solutions that don't require new locks or lock retry loops. I was hoping to make the retry loop no worse than the one we already perform for xfs_break_layouts(), and then the approach can be easily shared between ext4 and xfs. However before we get there, we need quite a bit of reworks (require struct page for dax, use pfns in the dax radix, disable long held page reference counts for DAX i.e. RDMA / V4L2...). I'll submit those preparation steps first and then we can circle back to the "how to wait for DAX-DMA to end" problem.