Re: [PATCH 3/3] xfs: only iget the file once when doing vectored scrub-by-handle

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Thu, 11 Apr 2024 17:21:33 -0700

On Thu, Apr 11, 2024 at 07:02:07AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 10, 2024 at 10:21:07PM -0700, Darrick J. Wong wrote:
> > On Wed, Apr 10, 2024 at 10:02:23PM -0700, Christoph Hellwig wrote:
> > > On Wed, Apr 10, 2024 at 09:56:45PM -0700, Darrick J. Wong wrote:
> > > > > Well, someone needs to own it, it's just not just ext4 but could us.
> > > > 
> > > > Er... I don't understand this?        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > 
> > > If we set current->journal and take a page faul we could not just
> > > recurse into ext4 but into any fs including XFS.  Any everyone
> > > blindly dereferences is as only one fs can own it.
> > 
> > Well back before we ripped it out I had said that XFS should just set
> > current->journal to 1 to prevent memory corruption but then Jan Kara
> > noted that ext4 changes its behavior wrt jbd2 if it sees nonzero
> > current->journal.  That's why Dave dropped it entirely.
> 
> If you are in a fs context you own current->journal_info.  But you
> also must make sure to not copy from and especially to user to not
> recurse into another file system.  A per-thread field can't work any
> other way.  So what ext4 is doing here is perfectly fine.  What XFS
> did was to set current->journal_info and then cause page faults, which
> is not ok.  I'm glad we fixed it.
> 
> > > That seems a bit dangerous to me.  I guess we rely on the code inside
> > > the transaction context to never race with unmount as lack of SB_ACTIVE
> > > will make the VFS ignore the dontcache flag.
> > 
> > That and we have an open fd to call the ioctl so any unmount will fail,
> > and we can't enter scrub if unmount already starte.
> 
> Indeed.
> 
> So I'm still confused on why this new code keeps the inode around if an
> error happend, but xchk_irele does not.  What is the benefit of keeping
> the inode around here?  Why des it not apply to xchk_irele?

OH!  Crap, I forgot that some years ago (after the creation of the
vectorized scrub patch) I cleaned up that behavior -- previously scrub
actually did play games with clearing dontcache if the inode was sick.

Then Dave pointed out that we could just change reclaim not to purge the
incore inode (and hence preserve the health state) until unmount or the
fs goes down, and clear I_DONTCACHE any time we notice bad metadata.
Hopefully the incore inode then survives long enough that anyone
scanning for filesystem health status will still see the badness state.

Therefore, we don't need the set_dontcache variable in this patch:

	/*
	 * If we're holding the only reference to an inode opened via
	 * handle, mark it dontcache so that we don't pollute the cache.
	 */
	if (handle_ip) {
		if (atomic_read(&VFS_I(handle_ip)->i_count) == 1)
			d_mark_dontcache(VFS_I(handle_ip));
		xfs_irele(handle_ip);
	}

> I also don't understand how d_mark_dontcache in
> xfs_ioctl_setattr_prepare_dax is supposed to work.  It'll make the inode
> go away quicker than without, but it can't force the inode by itself.

That's correct.  You can set the ondisk fsdax iflag and then wait
centuries for the incore fsdax to catch up.  I think this is a very
marginal design, but thankfully the intended design is that you set
daxinherit on the parent dir or mount with dax=always and all new files
just come up with both fsdax flags set.

> I'm also lot on the interaction of that with the scrub inodes due to
> both above.  I'd still expect any scrub iget to set uncached for
> a cache miss.  If we then need to keep the inode around in transaction
> context we just keep it.  What is the benefit of playing racing
> games with i_count to delay setting the dontcache flag until irele?

One thing I don't like about XFS_IGET_DONTCACHE is that a concurrent
iget call without DONTCACHE won't clear the state from the inode, which
can lead to unnecessary evictions.  This racy thing means we only set it
if we think the inode isn't in use anywhere else.

At this point, though, I think we could add XFS_IGET_DONTCACHE to all
the iget calls and drop the irele dontcache thing.

> And why does the DAX mess matter for that?

fsdax doesn't matter, it merely slurped up the functionality and now
we're tangled up in it.

> Maybe I'm just thick and this is all obvious, but then it needs to
> be documented in detailed comments.

No, it's ... very twisty and weird.  I wish dontcache had remained our
private xfs thing.

--D