Re: [PATCH 08/12] xfs: remove xfs_ifork_ops

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Mon, 18 May 2020 09:07:50 -0700

On Mon, May 18, 2020 at 09:35:46AM -0400, Brian Foster wrote:
> On Sat, May 16, 2020 at 10:48:02AM -0700, Darrick J. Wong wrote:
> > On Sat, May 09, 2020 at 07:13:44AM -0400, Brian Foster wrote:
> > > On Sat, May 09, 2020 at 10:17:15AM +0200, Christoph Hellwig wrote:
> > > > On Fri, May 08, 2020 at 11:05:43AM -0400, Brian Foster wrote:
> > > > > On Fri, May 08, 2020 at 08:34:19AM +0200, Christoph Hellwig wrote:
> > > > > > xfs_ifork_ops add up to two indirect calls per inode read and flush,
> > > > > > despite just having a single instance in the kernel.  In xfsprogs
> > > > > > phase6 in xfs_repair overrides the verify_dir method to deal with inodes
> > > > > > that do not have a valid parent, but that can be fixed pretty easily
> > > > > > by ensuring they always have a valid looking parent.
> > > > > > 
> > > > > > Signed-off-by: Christoph Hellwig <hch@xxxxxx>
> > > > > > ---
> > > > > 
> > > > > Code looks fine, but I assume we'll want a repair fix completed and
> > > > > merged before wiping this out:
> > > > 
> > > > With the xfsprogs merge delays I'm not sure merged will work, but I'll
> > > > happily take your patch and get it in shape for submission.
> > > > 
> > > 
> > > The critical bit is that repair is fixed before this lands in xfsprogs,
> > > otherwise we just reintroduce the regression the callback mechanism was
> > > designed to fix. The repair change is not huge, but it's not necessarily
> > > trivial so it's probably worth making sure the repair change is at least
> > > reviewed before putting this into the kernel pipeline.
> > > 
> > > BTW, I played with this a bit more yesterday and made some tweaks that I
> > > think make it a little cleaner. Namely instead of processing the parent
> > > bits in phases 3 and 4 and setting the parent in the internal structures
> > > in phase 4, to do everything in phase 3 and skip the repeat checks in
> > > phase 4. This has the side effect of eliminating some duplicate error
> > > messages where repair complains about the original bogus value in phase
> > > 3, sets it to zero, and then complains about the zero value again in
> > > phase 4. This still needs some auditing to assess whether we're losing
> > > any extra verification by setting the parent in phase 3, however. It
> > > also might be worth looking at giving the other dir formats the same
> > > treatment. Squashed diff of my local tree below...
> > > 
> > > Brian
> > > 
> > > diff --git a/repair/dino_chunks.c b/repair/dino_chunks.c
> > > index 6685a4d2..96ed6a5b 100644
> > > --- a/repair/dino_chunks.c
> > > +++ b/repair/dino_chunks.c
> > > @@ -859,14 +859,7 @@ next_readbuf:
> > >  		 */
> > >  		if (isa_dir)  {
> > >  			set_inode_isadir(ino_rec, irec_offset);
> > > -			/*
> > > -			 * we always set the parent but
> > > -			 * we may as well wait until
> > > -			 * phase 4 (no inode discovery)
> > > -			 * because the parent info will
> > > -			 * be solid then.
> > > -			 */
> > > -			if (!ino_discovery)  {
> > > +			if (ino_discovery)  {
> > >  				ASSERT(parent != 0);
> > >  				set_inode_parent(ino_rec, irec_offset, parent);
> > >  				ASSERT(parent ==
> > > diff --git a/repair/dir2.c b/repair/dir2.c
> > > index cbbce601..9c789b4a 100644
> > > --- a/repair/dir2.c
> > > +++ b/repair/dir2.c
> > > @@ -165,7 +165,6 @@ process_sf_dir2(
> > >  	int			tmp_elen;
> > >  	int			tmp_len;
> > >  	xfs_dir2_sf_entry_t	*tmp_sfep;
> > > -	xfs_ino_t		zero = 0;
> > >  
> > >  	sfp = (struct xfs_dir2_sf_hdr *)XFS_DFORK_DPTR(dip);
> > >  	max_size = XFS_DFORK_DSIZE(dip, mp);
> > > @@ -480,6 +479,9 @@ _("corrected entry offsets in directory %" PRIu64 "\n"),
> > >  	 * check parent (..) entry
> > >  	 */
> > >  	*parent = libxfs_dir2_sf_get_parent_ino(sfp);
> > > +	if (!ino_discovery)
> > > +		return 0;
> > > +
> > >  
> > >  	/*
> > >  	 * if parent entry is bogus, null it out.  we'll fix it later .
> > > @@ -494,7 +496,7 @@ _("bogus .. inode number (%" PRIu64 ") in directory inode %" PRIu64 ", "),
> > >  		if (!no_modify)  {
> > >  			do_warn(_("clearing inode number\n"));
> > >  
> > > -			libxfs_dir2_sf_put_parent_ino(sfp, zero);
> > > +			libxfs_dir2_sf_put_parent_ino(sfp, mp->m_sb.sb_rootino);
> > >  			*dino_dirty = 1;
> > >  			*repair = 1;
> > >  		} else  {
> > > @@ -529,7 +531,7 @@ _("bad .. entry in directory inode %" PRIu64 ", points to self, "),
> > >  		if (!no_modify)  {
> > >  			do_warn(_("clearing inode number\n"));
> > >  
> > > -			libxfs_dir2_sf_put_parent_ino(sfp, zero);
> > > +			libxfs_dir2_sf_put_parent_ino(sfp, mp->m_sb.sb_rootino);
> > >  			*dino_dirty = 1;
> > >  			*repair = 1;
> > >  		} else  {
> > > diff --git a/repair/phase6.c b/repair/phase6.c
> > > index beceea9a..43bcea50 100644
> > > --- a/repair/phase6.c
> > > +++ b/repair/phase6.c
> > > @@ -26,58 +26,6 @@ static struct xfs_name		xfs_name_dot = {(unsigned char *)".",
> > >  						1,
> > >  						XFS_DIR3_FT_DIR};
> > >  
> > > -/*
> > > - * When we're checking directory inodes, we're allowed to set a directory's
> > > - * dotdot entry to zero to signal that the parent needs to be reconnected
> > > - * during phase 6.  If we're handling a shortform directory the ifork
> > > - * verifiers will fail, so temporarily patch out this canary so that we can
> > > - * verify the rest of the fork and move on to fixing the dir.
> > > - */
> > > -static xfs_failaddr_t
> > > -phase6_verify_dir(
> > > -	struct xfs_inode		*ip)
> > > -{
> > > -	struct xfs_mount		*mp = ip->i_mount;
> > > -	struct xfs_ifork		*ifp;
> > > -	struct xfs_dir2_sf_hdr		*sfp;
> > > -	xfs_failaddr_t			fa;
> > > -	xfs_ino_t			old_parent;
> > > -	bool				parent_bypass = false;
> > > -	int				size;
> > > -
> > > -	ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
> > > -	sfp = (struct xfs_dir2_sf_hdr *)ifp->if_u1.if_data;
> > > -	size = ifp->if_bytes;
> > > -
> > > -	/*
> > > -	 * If this is a shortform directory, phase4 may have set the parent
> > > -	 * inode to zero to indicate that it must be fixed.  Temporarily
> > > -	 * set a valid parent so that the directory verifier will pass.
> > > -	 */
> > > -	if (size > offsetof(struct xfs_dir2_sf_hdr, parent) &&
> > > -	    size >= xfs_dir2_sf_hdr_size(sfp->i8count)) {
> > > -		old_parent = libxfs_dir2_sf_get_parent_ino(sfp);
> > > -		if (old_parent == 0) {
> > > -			libxfs_dir2_sf_put_parent_ino(sfp, mp->m_sb.sb_rootino);
> > > -			parent_bypass = true;
> > > -		}
> > > -	}
> > > -
> > > -	fa = libxfs_default_ifork_ops.verify_dir(ip);
> > > -
> > > -	/* Put it back. */
> > > -	if (parent_bypass)
> > > -		libxfs_dir2_sf_put_parent_ino(sfp, old_parent);
> > > -
> > > -	return fa;
> > > -}
> > > -
> > > -static struct xfs_ifork_ops phase6_ifork_ops = {
> > > -	.verify_attr	= xfs_attr_shortform_verify,
> > > -	.verify_dir	= phase6_verify_dir,
> > > -	.verify_symlink	= xfs_symlink_shortform_verify,
> > > -};
> > > -
> > >  /*
> > >   * Data structures used to keep track of directories where the ".."
> > >   * entries are updated. These must be rebuilt after the initial pass
> > > @@ -1104,7 +1052,7 @@ mv_orphanage(
> > >  					(unsigned long long)ino, ++incr);
> > >  
> > >  	/* Orphans may not have a proper parent, so use custom ops here */
> > > -	err = -libxfs_iget(mp, NULL, ino, 0, &ino_p, &phase6_ifork_ops);
> > > +	err = -libxfs_iget(mp, NULL, ino, 0, &ino_p, &xfs_default_ifork_ops);
> > 
> > Hmm.  I'll have to look at this more thoroughly on a non-weekend, but I
> > think I like this approach, since it removes the weird quirk that if
> > repair fails after writing out a sf directory with parent==0, we'll have
> > transformed an fs with bad directory parent pointers to an fs with a
> > directory that totally fails the verifier.
> > 
> 
> I don't _think_ xfs_repair should ever write back the parent == 0
> condition it creates. IIUC it's intended to be a transient state and the
> directory should eventually be fixed up or junked (or maybe lost+found?)

I've found those zeroes on disk in the past.  I think that happens if
you tweak the buffer cache in just the right way that a subsequent
cache_node_lookup causes a cache shake that starts writing things back
to disk to shove buffers onto the MRU.  In that particular instance, the
cache behavior was due to some other stupid bug that I'd accidentally
put in my dev tree, but at the time I recall noticing that it is
theoretically possible for buffers to get written out before the
explicit flush at the end.

> in phase 6. However there is a similar quirk within xfs_repair itself in
> that phase 3 sets the parent from whatever bad value it is to zero, then
> phase 4 repeats the same general scan and complains about parent == 0
> because that is also invalid. ;P

Yeah, that's also superfluous behavior. :)

--D

> Brian
> 
> > So for the kernel patch, provisionally:
> > Reviewed-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > 
> > --D
> > 
> > >  	if (err)
> > >  		do_error(_("%d - couldn't iget disconnected inode\n"), err);
> > >  
> > > @@ -2875,7 +2823,7 @@ process_dir_inode(
> > >  
> > >  	ASSERT(!is_inode_refchecked(irec, ino_offset) || dotdot_update);
> > >  
> > > -	error = -libxfs_iget(mp, NULL, ino, 0, &ip, &phase6_ifork_ops);
> > > +	error = -libxfs_iget(mp, NULL, ino, 0, &ip, &xfs_default_ifork_ops);
> > >  	if (error) {
> > >  		if (!no_modify)
> > >  			do_error(
> > > 
> > 
>