Re: [PATCH v4 18/18] xfs, dax: wire up dax_flush_dma support via a new xfs_sync_dma helper

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 3 Jan 2018 10:00:27 +1100

On Sat, Dec 23, 2017 at 04:57:37PM -0800, Dan Williams wrote:
> xfs_break_layouts() scans for active pNFS layouts, drops locks and
> rescans for those layouts to be broken. xfs_sync_dma performs
> xfs_break_layouts and also scans for active dax-dma pages, drops locks
> and rescans for those pages to go idle.
> 
> dax_flush_dma handles synchronizing against new page-busy events
> (get_user_pages). iIt invalidates all mappings to trigger the
> get_user_pages slow path which will eventually block on the
> XFS_MMAPLOCK. If it finds a dma-busy page it waits for a page-idle
> callback that will fire when the page reference count reaches 1 (recall
> ZONE_DEVICE pages are idle at count 1). While it is waiting, it drops
> locks so we do not deadlock the process that might be trying to elevate
> the page count of more pages before arranging for any of them to go idle
> as is typically the case of iov_iter_get_pages.
> 
> dax_flush_dma relies on the fs-provided wait_atomic_t_action_f
> (xfs_wait_dax_page) to handle evaluating the page reference count and
> dropping locks when waiting.

I don't see a problem with supporting this functionality, but I
see lots of problems with the code being presented. First of all,
I think the "sync dma" abstraction here is all wrong.

In the case of the filesystem, we don't care about whether DMA has
completed or not, and we *shouldn't have to care* about deep, dark
secrets of other subsystems.

If I read the current code, I see this in all the "truncate" paths:

	start op
	break layout leases
	change layout

and in the IO path:

	start IO
	break layout leases
	map IO
	issue IO

What this change does is make the truncate paths read:

	start op
	sync DMA
	change layout

but the IO path is unchanged. (This is not explained in comments or
commit messages).

And I look at that "sync DMA" step and wonder why the hell we need
to "sync DMA" because DMA has nothing to do with high level
filesystem code. It doesn't tell me anything obvious about why we
need to do this, nor does it tell me what we're actually
synchronising against.

What we care about in the filesystem code is whether there are
existing external references to file layout. If there's an external
reference, then it has to be broken before we can proceed and modify
the file layout. We don't care what owns that reference, just that
it has to broken before we continue.

AFAIC, these DMA references are just another external layout
reference that needs to be broken.  IOWs, this "sync DMA" complexity
needs to go inside xfs_break_layouts() as it is part of breaking the
external reference to the file layout - it does not replace the
layout breaking abstraction and so the implementation needs to
reflect that.

> + * Synchronize [R]DMA before changing the file's block map. For pNFS,
> + * recall all layouts. For DAX, wait for transient DMA to complete. All
> + * other DMA is handled by pinning page cache pages.
> + *
> + * iolock must held XFS_IOLOCK_SHARED or XFS_IOLOCK_EXCL on entry and
> + * will be XFS_IOLOCK_EXCL and XFS_MMAPLOCK_EXCL on exit.
> + */
> +int xfs_sync_dma(
> +	struct inode		*inode,
> +	uint			*iolock)
> +{
> +	struct xfs_inode	*ip = XFS_I(inode);
> +	int			error;
> +
> +	while (true) {
> +		error = xfs_break_layouts(inode, iolock);
> +		if (error)
> +			break;
> +
> +		xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> +		*iolock |= XFS_MMAPLOCK_EXCL;
> +
> +		error = dax_flush_dma(inode->i_mapping, xfs_wait_dax_page);
> +		if (error <= 0)
> +			break;
> +		xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> +		*iolock &= ~XFS_MMAPLOCK_EXCL;
> +	}

At this level the loop seems sub-optimal. If we don't drop the
IOLOCK, then we have no reason to call xfs_break_layouts() a second
time.  Hence in isolation this loop doesn' make sense. Yes, I
realise that dax_flush_dma() can result in all locks on the inode
being dropped, but that's hidden in another function whose calling
scope is not at all obvious from this code.

Also, xfs_wait_dax_page() assumes we have IOLOCK_EXCL held when it
is called. Nothing enforces the requirement that xfs_sync_dma() is
passed XFS_IOLOCK_EXCL, and so such assumptions cannot be made.
Even if it was, I really dislike the idea of a function that
/assumes/ lock state - that's a landmine that will bite us in the
rear end at some unexpected point in the future. If you need to
cycle held locks on an inode, you need to pass the held lock state
to the function.

Another gripe I have is that calling xfs_sync_dma() implies the mmap
lock is held exclusively on return. Hiding this locking inside
xfs_sync_dma removes the code documentation that large tracts of
code are protected against page faults by the XFS_MMAPLOCK_EXCL lock
call.  Instead of knowing at a glance that the truncate path
(xfs_vn_setattr) is protected against page faults, I have to
remember that xfs_sync_dma() now does this.

This goes back to my initial comments of "what the hell does "sync
dma" mean in a filesystem context?" - it certainly doesn't make me
think about inode locking. I don't like hidden/implicitly locking
like this, because it breaks both the least-surprise and the
self-documenting code principles....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html