Re: [PATCH 3/7] xfs: repair inode records

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 28 Nov 2023 15:08:48 -0800

On Tue, Nov 28, 2023 at 09:08:35AM -0800, Christoph Hellwig wrote:
> > @@ -1012,7 +1012,8 @@ enum xfs_dinode_fmt {
> >  #define XFS_DFORK_APTR(dip)	\
> >  	(XFS_DFORK_DPTR(dip) + XFS_DFORK_BOFF(dip))
> >  #define XFS_DFORK_PTR(dip,w)	\
> > -	((w) == XFS_DATA_FORK ? XFS_DFORK_DPTR(dip) : XFS_DFORK_APTR(dip))
> > +	((void *)((w) == XFS_DATA_FORK ? XFS_DFORK_DPTR(dip) : \
> > +					 XFS_DFORK_APTR(dip)))
> 
> Not requiring a cast when using XFS_DFORK_PTR is a good thing, but I
> think this is the wrong way to do it.  Instead of adding another cast
> here we can just change the char * cast in XFS_DFORK_DPTR to a void *
> one and rely on the widely used void pointer arithmetics extension in
> gcc (and clang).

Ok.

> That'll also need a fixup to use a void instead of
> char * cast in xchk_dinode.

I'll change the conditional to:

	if (XFS_DFORK_BOFF(dip) >= mp->m_sb.sb_inodesize)

> And in the long run many of these helpers relly should become inline
> functions..
> 
> > +	/* no large extent counts without the filesystem feature */
> > +	if ((flags2 & XFS_DIFLAG2_NREXT64) && !xfs_has_large_extent_counts(mp))
> > +		goto bad;
> 
> This is just a missing check and not really related to repair, is it?

Yep.  I guess I'll pull that out into a separate patch.

> > +	/*
> > +	 * The only information that needs to be passed between inode scrub and
> > +	 * repair is the location of the ondisk metadata if iget fails.  The
> > +	 * rest of struct xrep_inode is context data that we need to massage
> > +	 * the ondisk inode to the point that iget will work, which means that
> > +	 * we don't allocate anything at all if the incore inode is loaded.
> > +	 */
> > +	if (!imap)
> > +		return 0;
> 
> I don't really understand why this comment is here, and how it relates
> to the imap NULL check.  But as the only caller passes the address of an
> on-stack imap I also don't understand why the check is here to start
> with.

Hmm.  I think I've been through too many iterations of this code -- at
one point I remember the null check was actually useful for something.
But now it's not, so it can go.

> 
> > +	for (i = 0; i < ni; i++) {
> > +		ioff = i << mp->m_sb.sb_inodelog;
> > +		dip = xfs_buf_offset(bp, ioff);
> > +		agino = be32_to_cpu(dip->di_next_unlinked);
> > +
> > +		unlinked_ok = magic_ok = crc_ok = false;
> 
> I'd split the body of this loop into a separate helper and keep a lot of
> the variables local to it.

Ok.

> > +/* Reinitialize things that never change in an inode. */
> > +STATIC void
> > +xrep_dinode_header(
> > +	struct xfs_scrub	*sc,
> > +	struct xfs_dinode	*dip)
> > +{
> > +	trace_xrep_dinode_header(sc, dip);
> > +
> > +	dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
> > +	if (!xfs_dinode_good_version(sc->mp, dip->di_version))
> > +		dip->di_version = 3;
> 
> Can we ever end up here for v4 file systems? Because in that case
> the sane default inode version would be 2.

No.  xchk_validate_inputs will reject IFLAG_REPAIR on a V4 fs.  Those
are deprecated, there's no point in going back.

> > +
> > +/* Turn di_mode into /something/ recognizable. */
> > +STATIC void
> > +xrep_dinode_mode(
> > +	struct xfs_scrub	*sc,
> > +	struct xfs_dinode	*dip)
> > +{
> > +	uint16_t		mode;
> > +
> > +	trace_xrep_dinode_mode(sc, dip);
> > +
> > +	mode = be16_to_cpu(dip->di_mode);
> > +	if (mode == 0 || xfs_mode_to_ftype(mode) != XFS_DIR3_FT_UNKNOWN)
> 
> This is a somewhat odd way to check for a valid mode, but it works, so..

:)

> > +	if (xfs_has_reflink(mp) && S_ISREG(mode))
> > +		flags2 |= XFS_DIFLAG2_REFLINK;
> 
> We set the reflink flag by default, because a later stage will clear
> it if there aren't any shared blocks, right?  Maybe add a comment to
> avoid any future confusion.

	/*
	 * For regular files on a reflink filesystem, set the REFLINK flag to
	 * protect shared extents.  A later stage will actually check those
	 * extents and clear the flag if possible.
	 */

> 
> > +STATIC void
> > +xrep_dinode_zap_symlink(
> > +	struct xfs_scrub	*sc,
> > +	struct xfs_dinode	*dip)
> > +{
> > +	char			*p;
> > +
> > +	trace_xrep_dinode_zap_symlink(sc, dip);
> > +
> > +	dip->di_format = XFS_DINODE_FMT_LOCAL;
> > +	dip->di_size = cpu_to_be64(1);
> > +	p = XFS_DFORK_PTR(dip, XFS_DATA_FORK);
> > +	*p = '.';
> 
> Hmm, changing a symlink to actually point somewhere seems very
> surprising, but making it point to the current directory almost begs
> for userspace code to run in loops.

How about '🤷'?  That's only four bytes.

Or maybe a question mark.

> > +}
> > +
> > +/*
> > + * Blow out dir, make it point to the root.  In the future repair will
> > + * reconstruct this directory for us.  Note that there's no in-core directory
> > + * inode because the sf verifier tripped, so we don't have to worry about the
> > + * dentry cache.
> > + */
> 
> "make it point to root" isn't what I read in the code below.  I parents
> it in root I think.

Yes.  Changed to "Blow out dir, make the parent point to the root."

> > +/* Make sure we don't have a garbage file size. */
> > +STATIC void
> > +xrep_dinode_size(
> > +	struct xfs_scrub	*sc,
> > +	struct xfs_dinode	*dip)
> > +{
> > +	uint64_t		size;
> > +	uint16_t		mode;
> > +
> > +	trace_xrep_dinode_size(sc, dip);
> > +
> > +	mode = be16_to_cpu(dip->di_mode);
> > +	size = be64_to_cpu(dip->di_size);
> 
> Any reason to not simplify initialize the variables at declaration
> time?  (Same for a while bunch of other functions / variables)

No, not really.  Will fix.

> > +	if (xfs_has_reflink(sc->mp)) {
> > +		; /* data fork blockcount can exceed physical storage */
> 
> ... because we would be reflinking the same blocks into the same inode
> at different offsets over and over again ... ?

Yes.  That's not a terribly functional file, but users can do such
things if they want to pay for the cpu/metadata.

> Still, shouldn't we limit the condition to xfs_is_reflink_inode?

Yep.

> > +/* Check for invalid uid/gid/prid. */
> > +STATIC void
> > +xrep_inode_ids(
> > +	struct xfs_scrub	*sc)
> > +{
> > +	bool			dirty = false;
> > +
> > +	trace_xrep_inode_ids(sc);
> > +
> > +	if (i_uid_read(VFS_I(sc->ip)) == -1U) {
> 
> What is invalid about all-F uid/gid/projid?

I thought those were invalid, though apparently they're not now?
uidgid.h says:

static inline bool uid_valid(kuid_t uid)
{
	return __kuid_val(uid) != (uid_t) -1;
}

Which is why I thought that it's not possible to have a uid of -1 on a
file.  Trying to set that uid on a file causes the kernel to reject the
value, but OTOH I can apparently create inodes with a -1 UID via
idmapping shenanigans.

<shrug>

> > +	tstamp = inode_get_atime(inode);
> > +	xrep_clamp_timestamp(ip, &tstamp);
> > +	inode_set_atime_to_ts(inode, tstamp);
> 
> Meh, I hate these new VFS timestamp access helper..

They're very clunky.

> > +	/* Find the last block before 32G; this is the dir size. */
> > +	error = xfs_iread_extents(sc->tp, sc->ip, XFS_DATA_FORK);
> 
> I think that comments needs to go down to the off asignment and
> xfs_iext_lookup_extent_before call.

Done.

> > +/*
> > + * Fix any irregularities in an inode's size now that we can iterate extent
> > + * maps and access other regular inode data.
> > + */
> > +STATIC void
> > +xrep_inode_size(
> > +	struct xfs_scrub	*sc)
> > +{
> > +	trace_xrep_inode_size(sc);
> > +
> > +	/*
> > +	 * Currently we only support fixing size on extents or btree format
> > +	 * directories.  Files can be any size and sizes for the other inode
> > +	 * special types are fixed by xrep_dinode_size.
> > +	 */
> > +	if (!S_ISDIR(VFS_I(sc->ip)->i_mode))
> > +		return;
> 
> I think moving this check to the caller and renaming the function would
> be a bit nicer, especially if we grow more file type specific checks
> in the future.

That's the only one, so I'll rename it to xrep_inode_dir_size and hoist
this check to the caller.

> Otherwise this looks reasonable to me.

Woo, thanks for reading through all this. :)

--D