Re: [PATCH 5/5] xfs: fix agfl wrapping

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Wed, 28 Feb 2018 15:20:32 -0800

On Wed, Feb 28, 2018 at 05:43:51PM -0500, Brian Foster wrote:
> On Tue, Feb 27, 2018 at 01:03:13PM -0800, Darrick J. Wong wrote:
> > On Tue, Feb 27, 2018 at 02:35:49PM -0500, Brian Foster wrote:
> > > On Thu, Feb 22, 2018 at 06:00:15PM -0800, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > > ---
> > > >  fs/xfs/Makefile     |    1 
> > > >  fs/xfs/xfs_fixups.c |  310 +++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/xfs_fixups.h |   26 ++++
> > > >  fs/xfs/xfs_mount.c  |   21 +++
> > > >  fs/xfs/xfs_super.c  |   10 ++
> > > >  5 files changed, 367 insertions(+), 1 deletion(-)
> > > >  create mode 100644 fs/xfs/xfs_fixups.c
> > > >  create mode 100644 fs/xfs/xfs_fixups.h
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> > > > index b03c77e..f88368a 100644
> > > > --- a/fs/xfs/Makefile
> > > > +++ b/fs/xfs/Makefile
> > > > @@ -86,6 +86,7 @@ xfs-y				+= xfs_aops.o \
> > > >  				   xfs_extent_busy.o \
> > > >  				   xfs_file.o \
> > > >  				   xfs_filestream.o \
> > > > +				   xfs_fixups.o \
> > > >  				   xfs_fsmap.o \
> > > >  				   xfs_fsops.o \
> > > >  				   xfs_globals.o \
> > > > diff --git a/fs/xfs/xfs_fixups.c b/fs/xfs/xfs_fixups.c
> > > > new file mode 100644
> > > > index 0000000..0cad7bb
> > > > --- /dev/null
> > > > +++ b/fs/xfs/xfs_fixups.c
> > > > @@ -0,0 +1,310 @@
> > > > +/*
> > > > + * Copyright (C) 2018 Oracle.  All Rights Reserved.
> > > > + *
> > > > + * Author: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > > + *
> > > > + * This program is free software; you can redistribute it and/or
> > > > + * modify it under the terms of the GNU General Public License
> > > > + * as published by the Free Software Foundation; either version 2
> > > > + * of the License, or (at your option) any later version.
> > > > + *
> > > > + * This program is distributed in the hope that it would be useful,
> > > > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > > + * GNU General Public License for more details.
> > > > + *
> > > > + * You should have received a copy of the GNU General Public License
> > > > + * along with this program; if not, write the Free Software Foundation,
> > > > + * Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301, USA.
> > > > + */
> > > > +#include "xfs.h"
> > > > +#include "xfs_fs.h"
> > > > +#include "xfs_shared.h"
> > > > +#include "xfs_format.h"
> > > > +#include "xfs_log_format.h"
> > > > +#include "xfs_trans_resv.h"
> > > > +#include "xfs_sb.h"
> > > > +#include "xfs_mount.h"
> > > > +#include "xfs_alloc.h"
> > > > +#include "xfs_trans.h"
> > > > +#include "xfs_fixups.h"
> > > > +
> > > > +/*
> > > > + * v5 AGFL padding defects
> > > > + *
> > > > + * When the v5 format was first introduced, there was a defect in the struct
> > > > + * xfs_agfl definition that resulted in XFS_AGFL_SIZE returning different
> > > 
> > > XFS_AGFL_SIZE() no longer exists as of the previous patch. It might be
> > > better to size refer to "the agfl size" generically.
> > 
> > How about:
> > 
> > "...the agfl length calculation formula (at the time known as
> > XFS_AGFL_SIZE)..."
> > 
> 
> The formula formerly known as..? ;) Heh, sounds fine.

lol. :)

> > > > + * values depending on the compiler padding.  On a fs with 512-byte sectors,
> > > > + * this meant that XFS_AGFL_SIZE was 119 on i386, but 118 on x64.  Commit
> > > > + * 96f859d52bcb1 ("libxfs: pack the agfl header structure so XFS_AGFL_SIZE is
> > > > + * correct") changed the definition to disable padding the end of the
> > > > + * structure, and was accepted into Linux 4.5.  Since then, the AGFL has
> > > > + * always used the larger size (e.g. 119 entries on a 512b sector fs).
> > > > + *
> > > > + * Unfortunately, pre-4.5 kernels can produce filesystems with AGFLs that wrap
> > > > + * at the smaller size, and those kernels are not prepared to handle the
> > > > + * longer size.  This typically manifests itself as an AGF verifier corruption
> > > > + * error followed by a filesystem shutdown.  While we encourage admins to stay
> > > > + * current with software, we would like to avoid this intermittent breakage.
> > > > + *
> > > > + * Any v5 filesystem which has a feature bit set for a feature that was
> > > > + * introduced after Linux 4.5 will not have this problem, as such kernels
> > > > + * cannot be mounted on older kernels.  v4 filesystems are also unaffected.
> > > > + *
> > > > + * Therefore, we add two fixup functions -- the first runs at mount time to
> > > > + * detect a short-wrapped AGFL and fix it; the second runs at unmount, freeze,
> > > > + * or remount-ro time to move a wrapped AGFL to the beginning of the list.
> > > > + * This reduces the likelihood of a screwup to the scenario where you have (a)
> > > > + * a filesystem with no post-4.5 features (reflink, rmap), (b) the AGFL wraps,
> > > > + * (c) the filesystem goes down leaving a dirty log, and (d) the dirty
> > > > + * filesystem is mounted on an old kernel.
> > > > + */
> > > > +
> > > 
> > > While the mount vs. unmount time fixups serve different purposes and
> > > have different conditions, do we really need two separate fixup
> > > implementations? E.g., if we have to unwrap the AGFL at unmount time,
> > > why not just reuse that fixup at mount time (when needed) as well? I
> > > suspect the difference would only be the length of what we can consider
> > > valid in the agfl at the time.
> > 
> > They're doing two different things.  I will use ASCII art to demonstrate
> > the mount-time function:
> > 
> 
> Yes, the question was whether we need to do two different things. For
> example...
> 
> > Let's say the AGFL block has space for 10 items.  It looks like this:
> > 
> >          0   1   2   3   4   5   6   7   8   9
> > Header |   |   |   |   |   |   |   |   |   |   |
> > 
> > 
> > When the fs is freshly formatted or repaired, the AGFL will look like:
> > 
> >          0   1   2   3   4   5   6   7   8   9
> > Header | A | B | C | D | E | F |   |   |   |   |
> >          ^-- flfirst         ^-- fllast
> > 
> > Due to the padding problems prior to 4.5 ("4.5 agfl padding fix"),
> > XFS_AGFL_SIZE would return 10 on 32-bit systems and 9 on 64-bit systems.
> > Therefore, if the AGFL wrapped on a 64-bit kernel we would end up with:
> > 
> >          0   1   2   3   4   5   6   7   8   9
> > Header | D | E | F |   |   |   | A | B | C | ? |
> >                  ^-- fllast      ^-- flfirst
> > 
> > Note that block 9's contents are undefined because we didn't write
> > anything there.  Now the "4.5 agfl padding fix" has corrected
> > XFS_AGFL_SIZE to return 10 in all cases, this looks like an AGF
> > corruption because flcount is 6 but the distance between flfirst and
> > fllast is 7.  So long as the list wraps and the mismatch is exactly one
> > block we can fix this.  The mount time fixer in this patch does this by
> > moving the records between flfirst and xfs_agfl_size (A, B, and C) to
> > the end of the block to close the gap:
> > 
> >          0   1   2   3   4   5   6   7   8   9
> > Header | D | E | F |   |   |   |   | A | B | C |
> >                  ^-- fllast          ^-- flfirst
> > 
> > If the count is off by more than 1 then the AGF is truly corrupt and we
> > bail out.
> > 
> > Now let's say that we run for a while and want to unmount, but our AGFL
> > wraps like this:
> > 
> >          0   1   2   3   4   5   6   7   8   9
> > Header | T | U | V |   |   |   |   | Q | R | S |
> >                  ^-- fllast          ^-- flfirst
> > 
> > We don't know that the next kernel to mount this filesystem will have
> > the "4.5 agfl padding fix" applied to it; if it does not, it will flag
> > the AGF as corrupted because flcount is 6 but in its view the distance
> > between flfirst and fllast (which omits bno[9]) is 5.  We don't want
> > it to choke on that, so the unmount fixer moves all the records between
> > flfirst and the end (Q, R, and S) towards the start of the block and
> > resets flfirst/fllast:
> > 
> >          0   1   2   3   4   5   6   7   8   9
> > Header | T | U | V | Q | R | S |   |   |   |   |
> >          ^-- flfirst         ^-- fllast
> > 
> 
> Could we use this basic algorithm at mount time as well? I know it
> wouldn't be _exactly_ the same operation at mount time as it is for
> unmount since slot 9 is a gap in the former case, but afaict the only
> difference from an algorithmic perspective is the length of the shift.
> 
> IOW, if we were to parameterize the length of the shift and have the
> mount fixup call the unmount algorithm, would it not also address the
> problem?

Yes, I believe that would work.  I think it would be more efficient in
the patch below to memmove the entries instead of put_freelist'ing them
individually.

If the agfl is completely full then fix_freelist ought to trim it down
by at least one element.  The algorithm then becomes:

if mounting and flfirst < fllast:
	return 0
if flcount == agfl_size:
	assert !mounting
	fix_freelist()
	assert flcount < agfl_size
if flfirst < fllast:
	return 0
movelen = agfl_size - flfirst
if active == flcount - 1:
	movelen--
memmove(&agflbno[fllast + 1], &agflbno[flfirst], movelen)
flfirst = 0
fllast = fllast + movelen

> > Since the AGFL no longer wraps at all, it doesn't matter if the next
> > kernel to mount this filesystem has the "4.5 agfl padding fix" applied;
> > all kernels can handle this correctly.
> > 
> > > > +/*
> > > > + * Decide if we need to have the agfl wrapping fixes applied.  This only
> > > > + * affects v5 filesystems that do not have any features enabled that did not
> > > > + * exist when the agfl padding fix went in.
> > > > + *
> > > > + * Features already present when the fix went in were finobt, ftype, spinodes.
> > > > + * If we see something new (e.g. reflink) then don't bother.
> > > > + */
> > > > +#define XFS_SB_FEAT_RO_COMPAT_AGFL_WRAP_ALREADY_FIXED \
> > > > +		(~(XFS_SB_FEAT_RO_COMPAT_FINOBT))
> > > > +#define XFS_SB_FEAT_INCOMPAT_AGFL_WRAP_ALREADY_FIXED \
> > > > +		(~(XFS_SB_FEAT_INCOMPAT_FTYPE | \
> > > > +		   XFS_SB_FEAT_INCOMPAT_SPINODES))
> > > > +#define XFS_SB_FEAT_INCOMPAT_LOG_AGFL_WRAP_ALREADY_FIXED \
> > > > +		(~0)
> > > > +static inline bool xfs_sb_version_needs_agfl_wrap_fixes(struct xfs_sb *sbp)
> > > > +{
> > > > +	return xfs_sb_version_hascrc(sbp) &&
> > > > +		!xfs_sb_has_incompat_feature(sbp,
> > > > +			XFS_SB_FEAT_INCOMPAT_AGFL_WRAP_ALREADY_FIXED) &&
> > > > +		!xfs_sb_has_ro_compat_feature(sbp,
> > > > +			XFS_SB_FEAT_RO_COMPAT_AGFL_WRAP_ALREADY_FIXED) &&
> > > > +		!xfs_sb_has_incompat_log_feature(sbp,
> > > > +			XFS_SB_FEAT_INCOMPAT_LOG_AGFL_WRAP_ALREADY_FIXED);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Fix an AGFL wrapping that falls short of the end of the block by filling the
> > > > + * gap at the end of the block.
> > > > + */
> > > > +STATIC int
> > > > +xfs_fixup_freelist_wrap_mount(
> > > > +	struct xfs_trans	*tp,
> > > > +	struct xfs_buf		*agfbp,
> > > > +	struct xfs_perag	*pag)
> > > > +{
> > > > +	struct xfs_mount	*mp = tp->t_mountp;
> > > > +	struct xfs_agf		*agf;
> > > > +	struct xfs_buf		*agflbp;
> > > > +	__be32			*agfl_bno;
> > > > +	xfs_agnumber_t		agno;
> > > > +	uint32_t		agfl_size;
> > > > +	uint32_t		flfirst;
> > > > +	uint32_t		fllast;
> > > > +	int32_t			active;
> > > > +	int			offset;
> > > > +	int			len;
> > > > +	int			error;
> > > > +
> > > > +	if (pag->pagf_flcount == 0)
> > > > +		return 0;
> > > > +
> > > > +	agfl_size = xfs_agfl_size(mp);
> > > > +	agf = XFS_BUF_TO_AGF(agfbp);
> > > > +	agno = be32_to_cpu(agf->agf_seqno);
> > > > +	flfirst = be32_to_cpu(agf->agf_flfirst);
> > > > +	fllast = be32_to_cpu(agf->agf_fllast);
> > > > +
> > > > +	/* Make sure we're either spot on or off by 1. */
> > > > +	active = fllast - flfirst + 1;
> > > > +	if (active <= 0)
> > > > +		active += agfl_size;
> > > > +	if (active == pag->pagf_flcount)
> > > > +		return 0;
> > > > +	else if (active != pag->pagf_flcount + 1)
> > > > +		return -EFSCORRUPTED;
> > > > +
> > > 
> > > So we're not attempting to cover the case where the agfl has 1 more
> > > block than the agfl size (i.e., the case where an fs goes back to a
> > > kernel with an unpacked header)?
> > 
> > We don't know how the next kernel to touch this filesystem will define
> > XFS_AGFL_SIZE -- it could be a 4.5+ kernel (same agfl size), a 32-bit
> > pre-4.5 kernel (same agfl size), or a 64-bit pre-4.5 kernel (small agfl
> > size).
> > 
> 
> I don't think I was clear.. I'm envisioning whether we could come up
> with a patch that would generically fix up the agfl on disk to be sane
> relative to the current kernel. This patch covers the case of a packed
> agfl kernel mounting an unpacked on-disk agfl. It would be nice if we
> could implement something that also handled a packed on-disk agfl to an
> unpacked agfl kernel (for easier backport to unpacked kernels, for
> e.g.).

If we're going to touch an old kernel's source at all I'd rather we
backport both the packing fix and this fixer-upper.

> > > I'm wondering whether using the unmount algorithm in both places (as
> > > noted above) would also facilitate this mechanism to work both ways
> > > (unpacked <-> packed), provided particular conditions are met. I suppose
> > > that could be considered regardless, but the less variance the better
> > > IMO.
> > 
> > In principle I suppose we could have a single function to handle all the
> > rearranging, but I'd rather have two functions to handle the two cases.
> > I'm open to refactoring the common parts out of the
> > xfs_fixup_freelist_wrap_*mount functions.
> > 
> 
> As noted above, I'm not really commenting on the function factoring here
> as much as the use of multiple algorithms (using one would certainly
> facilitate some mechanical refactoring).

TBH the patch you sent is pretty close to what I was imagining.

> > > > +	/* Would this have even passed muster on an old system? */
> > > 
> > > Comment doesn't really explain what is going on here..?
> > 
> > /*
> >  * If the distance between flfirst and fllast mismatches flcount by more
> >  * than 1, then there's more wrong with this agfl than just the padding
> >  * problem.  Bail out completely, which will force the admin to run
> >  * xfs_repair.
> >  */
> > 
> 
> *nod*
> 
> > > 
> > > > +	if (flfirst >= agfl_size - 1 || fllast >= agfl_size - 1 ||
> > > > +	    pag->pagf_flcount > agfl_size - 1)
> > > > +		return -EFSCORRUPTED;
> > > > +
> > > 
> > > I take it these are checks for whether the agfl was previously corrupted
> > > based on the unpacked size? FWIW, it might be a bit more clear to use an
> > > old/unpacked_agfl_size variable to declare intent (here and in some of
> > > the comments and whatnot that follow).
> > 
> > Ok.
> > 
> > > > +	/*
> > > > +	 * Convert a 40-byte-padded agfl into a 36-byte-padded AGFL.
> > > > +	 * Therefore, we need to move the AGFL blocks
> > > > +	 * bno[flfirst..agfl_size - 2] to bno[flfirst + 1...agfl_size - 1].
> > > > +	 *
> > > > +	 * Reusing the example above, if we had flfirst == 116, we need
> > > > +	 * to move bno[116] and bno[117] into bno[117] and bno[118],
> > > > +	 * respectively, and then increment flfirst.
> > > > +	 */
> > > 
> > > Kind of a strange example to use given that it doesn't mention
> > > wrapping.. this only triggers if the agfl was previously wrapped, right?
> > 
> > Right.  I used to have a check that if the list didn't wrap then we
> > would just exit because no fixing is required... but it must've fallen
> > out.  I agree that we don't want to rely on a subtlety here to stay out
> > of here if the agfl wasn't initially wrapped.
> > 
> > > > +	error = xfs_alloc_read_agfl(mp, tp, agno, &agflbp);
> > > > +	if (error)
> > > > +		return error;
> > > > +	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agflbp);
> > > > +
> > > > +	len = (agfl_size - flfirst - 1) * sizeof(xfs_agblock_t);
> > > > +	memmove(&agfl_bno[flfirst + 1], &agfl_bno[flfirst], len);
> > > > +	offset = (char *)&agfl_bno[flfirst + 1] - (char *)agflbp->b_addr;
> > > > +	be32_add_cpu(&agf->agf_flfirst, 1);
> > > > +
> > > > +	xfs_trans_buf_set_type(tp, agflbp, XFS_BLFT_AGFL_BUF);
> > > > +	xfs_trans_log_buf(tp, agflbp, offset, offset + len - 1);
> > > > +	xfs_trans_brelse(tp, agflbp);
> > > > +	agflbp = NULL;
> > > > +	xfs_alloc_log_agf(tp, agfbp, XFS_AGF_FLFIRST);
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Fix an AGFL that touches the end of the block by moving the first or last
> > > > + * part of the list elsewhere in the AGFL so that old kernels don't trip over
> > > > + * wrapping issues.
> > > > + */
> > > > +STATIC int
> > > > +xfs_fixup_freelist_wrap_unmount(
> > > > +	struct xfs_trans	*tp,
> > > > +	struct xfs_buf		*agfbp,
> > > > +	struct xfs_perag	*pag)
> > > > +{
> > > > +	struct xfs_mount	*mp = tp->t_mountp;
> > > > +	struct xfs_agf		*agf;
> > > > +	struct xfs_buf		*agflbp;
> > > > +	__be32			*agfl_bno;
> > > > +	xfs_agnumber_t		agno;
> > > > +	uint32_t		agfl_size;
> > > > +	uint32_t		flfirst;
> > > > +	uint32_t		fllast;
> > > > +	int			offset;
> > > > +	int			len;
> > > > +	int			error;
> > > > +
> > > > +	agfl_size = xfs_agfl_size(mp);
> > > > +	agf = XFS_BUF_TO_AGF(agfbp);
> > > > +	agno = be32_to_cpu(agf->agf_seqno);
> > > > +	flfirst = be32_to_cpu(agf->agf_flfirst);
> > > > +	fllast = be32_to_cpu(agf->agf_fllast);
> > > > +
> > > > +	/* Empty AGFL?  Make sure we aren't pointing at the end. */
> > > > +	if (pag->pagf_flcount == 0) {
> > > > +		if (flfirst >= agfl_size || fllast >= agfl_size) {
> > > 
> > > When would either field be >= agfl_size? Isn't that where they wrap?
> > 
> > Oops, you're right, that should be flfirst >= agfl_size - 1.
> > 
> > Or really,
> > 
> > old_agfl_size = agfl_size - 1
> > if flfirst >= old_agfl_size or fllast >= old_agfl_size:
> > 	blahblahblah
> > 
> > > 
> > > > +			agf->agf_flfirst = cpu_to_be32(1);
> > > > +			agf->agf_fllast = 0;
> > > > +			xfs_alloc_log_agf(tp, agfbp,
> > > > +					XFS_AGF_FLFIRST | XFS_AGF_FLLAST);
> > > > +		}
> > > > +		return 0;
> > > > +	}
> > > > +
> > > > +	/* If we don't hit the end, we're done. */
> > > > +	if (flfirst < fllast && fllast != agfl_size - 1)
> > > > +		return 0;
> > > > +
> > > > +	/*
> > > > +	 * Move a start of a wrapped list towards the start of the agfl block.
> > > 
> > > FWIW, this and the subsequent comments kind of gave me the impression
> > > that the agfl would be "shifted" towards the start of the block. The
> > > code doesn't do that, but rather rotates the start/head of the agfl to
> > > the tail and adjusts the pointers (changing the effective order). That
> > > seems technically Ok, but I had to grok the code and read back to
> > > understand the comment rather than the other way around.
> > 
> > Yeah.  Sorry about the confusion, maybe it would be better if I took out
> > these confusing comments and replaced it all with the ascii art above?
> > 
> 
> I'm still hoping we can do something a bit more simple and flexible in
> general, but that aside, the ascii art describes the associated
> algorithms much better.
> 
> > > > +	 * Therefore, we need to move the AGFL blocks
> > > > +	 * bno[flfirst..agfl_size - 1] to bno[fllast + 1...agfl_size - flfirst].
> > > > +	 * Then we reset flfirst and fllast appropriately.
> > > > +	 *
> > > > +	 * Reusing the example above, if we had flfirst == 117 and fllast == 4,
> > > > +	 * we need to move bno[117] and bno[118] into bno[5] and bno[6],
> > > > +	 * respectively, and then reset flfirst and fllast.
> > > > +	 *
> > > > +	 * If it's just the last block that touches the end, only move that.
> > > > +	 */
> > > > +	error = xfs_alloc_read_agfl(mp, tp, agno, &agflbp);
> > > > +	if (error)
> > > > +		return error;
> > > > +	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agflbp);
> > > > +
> > > > +	if (fllast == agfl_size - 1) {
> > > > +		/* Back the AGFL off from the end of the block. */
> > > > +		len = sizeof(xfs_agblock_t);
> > > > +		agfl_bno[flfirst - 1] = agfl_bno[agfl_size - 1];
> > > 
> > > What if the agfl is full (flfirst == 0)?
> > 
> > I don't think we ever unmount with a full agfl -- we're guaranteed at
> > least 118 items in the AGFL, and the maximum size the agfl will ever
> > need is 8 blocks for each of bnobt, cntbt, rmapbt.
> > 
> > However, we could just detect a full agfl and free the end block.
> > 
> 
> I suppose it's unlikely we'll ever see corner cases such as a completely
> full or empty agfl. That said, I think the code needs to handle those
> cases sanely. That could mean explicit checks and skipping the fixup for
> all I care, just as long as we don't have memory corruption vectors and
> whatnot if (when?) assumptions prove wrong.
> 
> > > > +		offset = (char *)&agfl_bno[flfirst - 1] - (char *)agflbp->b_addr;
> > > > +		be32_add_cpu(&agf->agf_fllast, -1);
> > > > +		be32_add_cpu(&agf->agf_flfirst, -1);
> > > > +	} else {
> > 
> > This needs to check for flfirst > fllast.
> > 
> > > > +		/* Move the first part of the AGFL towards the front. */
> > > > +		len = (agfl_size - flfirst) * sizeof(xfs_agblock_t);
> > > > +		memcpy(&agfl_bno[fllast + 1], &agfl_bno[flfirst], len);
> > > > +		offset = (char *)&agfl_bno[fllast + 1] - (char *)agflbp->b_addr;
> > > > +		agf->agf_flfirst = 0;
> > > > +		agf->agf_fllast = cpu_to_be32(pag->pagf_flcount - 1);
> > > 
> > > Similar question here as above.. it looks like the copy may be a no-op,
> > > but we still reset flfirst/fllast (which may also be fine, but
> > > unnecessary).
> > > 
> > > The more interesting case here is flcount == 1 and flfirst == fllast,
> > > where it looks like we'd wrongly reset first/last (and perhaps copy more
> > > than we should..?).
> > > 
> > > Ugh. Remind me again why we can't just detect a mismatch, fail the mount
> > > and tell the user to repair? :P
> > 
> > Not all the distros included xfs_repair in the initrd or the logic to
> > call it (as opposed to "fsck -y") if the root fs fails to mount, which
> > means that if this wrapping problem hits a root filesystem then the
> > administrator will have to reboot with a recovery iso and run xfs_repair
> > from there.  That's going to generate a /lot/ of support work compared
> > to us figuring out how to repair the filesystems automatically.
> > 
> 
> Yeah, sorry. Good argument for the unpacked -> packed fixup.
> 
> > > Random thought... have we considered anything less invasive/more
> > > simple for this issue than low level buffer manipulation? For example,
> > > could we address the unpacked -> packed case by just allocating a
> > > block to fill the gap at the end of the wrapped agfl?
> > 
> > The allocation could fail, in which case we'd be forced to resort to
> > list manipulation.
> > 
> 
> Yeah.. obvious tradeoff, but how likely is that? It seems a reasonable
> tradeoff to me given there's still the repair path. Also note that we
> don't have to necessarily allocate from the btrees.
> 
> (FWIW, I hacked around on this a bit and ended up with something
> slightly different[1].)
> 
> > > Perhaps that could even be done more discreetly in
> > > xfs_alloc_fix_freelist() when we already have a transaction set up,
> > > etc.
> > 
> > Possible, though this will add more logic to every hot allocation call
> > for something that only needs fixing at mount and at unmount.  I think
> > there's a stronger argument for doing the unmount cleanup on every
> > fix_freelist since that would eliminate the small chance that a new
> > kernel crashes and gets recovered on an old kernel.
> > 
> 
> Eh, I don't think that's anything we couldn't address if need be.  E.g.,
> amortize the padding fixup against actual allocations, set/test a pag
> bit to track the padding fixup, or check it on agf read and set a "needs
> fixup" bit, etc.
> 
> > > That wouldn't help going the other direction, but maybe we could
> > > consider similar logic to free an extraneous block and then worry
> > > about protecting users who won't upgrade broken kernels separately if
> > > that really continues to be a problem.
> > 
> > It is already a problem here.
> > 
> 
> Then it should be able to stand alone as an independent patch (with
> independent commit log that explains/justifies it, etc.).
> 
> Brian
> 
> [1] This is hacky and brute force, but the idea was to try and do
> something that could work in either direction and reuse more existing
> code. It basically just reads the on-disk entries "manually" and
> reinserts them into a reinitialized (empty) agfl. Perhaps it could be
> refactored into something that only rotates a given number of entries or
> something of that nature. Anyways, just an experiment and probably
> broken..

I'll give this a whirl and see where we end up.

--D

> int
> xfs_agfl_fixup(
> 	struct xfs_trans	*tp,
> 	struct xfs_buf		*agbp,
> 	struct xfs_perag	*pag)
> {
> 	struct xfs_mount	*mp = tp->t_mountp;
> 	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agbp);
> 	int			agfl_size = XFS_AGFL_SIZE(mp);
> 	int			ofirst, olast, osize;
> 	int			nfirst, nlast;
> 	struct xfs_buf		*agflbp;
> 	__be32			*agfl_bno;
> 	int			active;
> 	xfs_agblock_t		bno;
> 	bool			wasempty = false;
> 
> 	if (pag->pagf_flcount == 0)
> 		wasempty = true;
> 
> 	ofirst = be32_to_cpu(agf->agf_flfirst);
> 	olast = be32_to_cpu(agf->agf_fllast);
> 	if (olast >= ofirst)
> 		active = olast - ofirst + 1;
> 	else
> 		active = agfl_size - ofirst + olast + 1;
> 
> 	if (active == pag->pagf_flcount + 1)
> 		osize = agfl_size - 1;
> 	else if ((active == pag->pagf_flcount - 1) ||
> 		 ofirst == agfl_size || olast == agfl_size)
> 		osize = agfl_size + 1;
> 	else if (active == pag->pagf_flcount)
> 		osize = agfl_size;
> 	else
> 		return -EFSCORRUPTED;
> 
> 	xfs_alloc_read_agfl(mp, tp, be32_to_cpu(agf->agf_seqno), &agflbp);
> 	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agflbp);
> 
> 	nlast = do_mod(olast, agfl_size);
> 	nfirst = nlast + 1;
> 	nfirst = do_mod(nfirst, agfl_size);
> 	ASSERT(nfirst != ofirst);
> 
> 	agf->agf_flfirst = cpu_to_be32(nfirst);
> 	agf->agf_fllast = cpu_to_be32(nlast);
> 	agf->agf_flcount = 0;
> 	xfs_alloc_log_agf(tp, agbp, XFS_AGF_FLFIRST | XFS_AGF_FLLAST |
> 				    XFS_AGF_FLCOUNT);
> 	pag->pagf_flcount = 0;
> 
> 	if (wasempty)
> 		goto out;
> 
> 	while (true) {
> 		ofirst = do_mod(ofirst, osize);
> 		bno = be32_to_cpu(agfl_bno[ofirst]);
> 		xfs_alloc_put_freelist(tp, agbp, agflbp, bno, 0);
> 		if (ofirst == olast)
> 			break;
> 		ofirst++;
> 	}
> 
> out:
> 	xfs_trans_brelse(tp, agflbp);
> 	return 0;
> }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html