Re: [PATCH 05/28] xfs: iget for metadata inodes

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 15 Oct 2024 12:04:04 -0700

On Tue, Oct 15, 2024 at 05:55:23PM +1100, Dave Chinner wrote:
> On Thu, Oct 10, 2024 at 05:49:40PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@xxxxxxxxxx>
> > 
> > Create a xfs_metafile_iget function for metadata inodes to ensure that
> > when we try to iget a metadata file, the inobt thinks a metadata inode
> > is in use and that the metadata type matches what we are expecting.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > Reviewed-by: Christoph Hellwig <hch@xxxxxx>
> > ---
> .....
> 
> > +/*
> > + * Get a metadata inode.  The metafile @type must match the inode exactly.
> > + * Caller must supply a transaction (even if empty) to avoid livelocking if the
> > + * inobt has a cycle.
> 
> Is this something we have to be concerned with for normal operation?

I think we should be more concerned about this everywhere.  Remember
that guy who fuzzed a btree so that the lower levels pointed "down" to
an upper level btree block?  Detection and recovery from that is what
running with xfs_trans_alloc_empty() buys us.  It's not a general cycle
detector as you point out below...

> We don't care about needing to detect inobt cycles this when doing
> lookups from the VFS during normal operation, so why is this an
> issue we have to be explicitly concerned about during metadir
> traversals?
> 
> Additionally, I can see how a corrupt btree ptr loop could cause a
> *deadlock* without a transaction context (i.e. end up waiting on a
> lock we already hold) but I don't see what livelock the transaction
> context prevents. It appears to me that it would only turn the
> deadlock into a livelock because the second buffer lookup will find
> the locked buffer already linked to the transaction and simply take
> another reference to it.  Hence it will just run around the loop of
> buffers taking references forever (i.e. a livelock) instead of
> deadlocking.

...because we don't detect loops within a particular level of a btree.
In theory this is possible if you are willing to enhance (for example)
a XFS_BB_RIGHTSIB move by comparing the rightmost record in the old
block against the leftmost record in the new block, but right now the
codebase (except for scrub) doesn't do that.

Maybe someone'll do that some day; it would close a pretty big hole in
runtime problem detection.

> Another question: why are we only concerned cycles in the inobt? If
> we've got a cycle in any other btree the metadir code might interact
> with (e.g. directories, free space, etc), we're going to have the
> same problems with deadlocks and/or livelocks on tree traversal. 

We /do/ hold empty transactions when calling xfs_metadir_load so that we
can error out on wrong-level types of dabtree cycles instead of hanging
the kernel.

> > + */
> > +int
> > +xfs_trans_metafile_iget(
> > +	struct xfs_trans	*tp,
> > +	xfs_ino_t		ino,
> > +	enum xfs_metafile_type	metafile_type,
> > +	struct xfs_inode	**ipp)
> > +{
> > +	struct xfs_mount	*mp = tp->t_mountp;
> > +	struct xfs_inode	*ip;
> > +	umode_t			mode;
> > +	int			error;
> > +
> > +	error = xfs_iget(mp, tp, ino, XFS_IGET_UNTRUSTED, 0, &ip);
> 
> Along a similar line: why don't we trust our own internal inode
> references to be correct and valid? These aren't structures that are
> externally visible or accessible, so the only thing that should be
> interacting with metadir inodes is trusted kernel code.
> 
> At minimum, this needs a big comment explaining why we can't trust
> our metadir structure and why this is different to normal inode
> lookup done for inodes accessed by path traversal from the root
> dir.
> 
> Also, doing a trusted lookup means we don't have to walk the inobt
> during the inode lookup, and so the deadlock/livelock problem goes
> away....
> 
> > @@ -1133,45 +1130,54 @@ xfs_rtmount_iread_extents(
> >   * Get the bitmap and summary inodes and the summary cache into the mount
> >   * structure at mount time.
> >   */
> > -int					/* error */
> > +int
> >  xfs_rtmount_inodes(
> > -	xfs_mount_t	*mp)		/* file system mount structure */
> > +	struct xfs_mount	*mp)
> >  {
> > -	int		error;		/* error return value */
> > -	xfs_sb_t	*sbp;
> > +	struct xfs_trans	*tp;
> > +	struct xfs_sb		*sbp = &mp->m_sb;
> > +	int			error;
> >  
> > -	sbp = &mp->m_sb;
> > -	error = xfs_iget(mp, NULL, sbp->sb_rbmino, 0, 0, &mp->m_rbmip);
> 
> .... and it's clear that we currently treat the superblock inodes as
> trusted inode numbers. We don't validate that they are allocated
> when we look them up, we trust that they have been correctly
> allocated and properly referenced on disk.
> 
> It's not clear to me why there's an undocumented change of
> trust for internal, trusted access only on-disk metadata being made
> here...

I feel it prudent to check for inconsistencies between the inobt state
and the inode records themselves, because bad space metadata can cause
disproportionately more problems in a filesystem than a regular file's
metadata.  Unlike regular files that can be unloaded and reloaded at
will, we only do this once, so the overhead of the inobt walk is
amortized over the entire mount-lifetime of the filesystem because we
don't irele metadata inodes.

IOWs it's not strictly necessary, but the cost is low to have this
additional safety.

But you're right, that should've been in the comment for this function.

--D

> -Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
>