On Thu, Mar 01, 2018 at 12:28:33PM -0500, Brian Foster wrote: > On Wed, Feb 28, 2018 at 03:20:32PM -0800, Darrick J. Wong wrote: > > On Wed, Feb 28, 2018 at 05:43:51PM -0500, Brian Foster wrote: > > > On Tue, Feb 27, 2018 at 01:03:13PM -0800, Darrick J. Wong wrote: > > > > On Tue, Feb 27, 2018 at 02:35:49PM -0500, Brian Foster wrote: > > > > > On Thu, Feb 22, 2018 at 06:00:15PM -0800, Darrick J. Wong wrote: > > > > > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > ... > > > > > > diff --git a/fs/xfs/xfs_fixups.c b/fs/xfs/xfs_fixups.c > > > > > > new file mode 100644 > > > > > > index 0000000..0cad7bb > > > > > > --- /dev/null > > > > > > +++ b/fs/xfs/xfs_fixups.c > > > > > > @@ -0,0 +1,310 @@ > ... > > > > Now let's say that we run for a while and want to unmount, but our AGFL > > > > wraps like this: > > > > > > > > 0 1 2 3 4 5 6 7 8 9 > > > > Header | T | U | V | | | | | Q | R | S | > > > > ^-- fllast ^-- flfirst > > > > > > > > We don't know that the next kernel to mount this filesystem will have > > > > the "4.5 agfl padding fix" applied to it; if it does not, it will flag > > > > the AGF as corrupted because flcount is 6 but in its view the distance > > > > between flfirst and fllast (which omits bno[9]) is 5. We don't want > > > > it to choke on that, so the unmount fixer moves all the records between > > > > flfirst and the end (Q, R, and S) towards the start of the block and > > > > resets flfirst/fllast: > > > > > > > > 0 1 2 3 4 5 6 7 8 9 > > > > Header | T | U | V | Q | R | S | | | | | > > > > ^-- flfirst ^-- fllast > > > > > > > > > > Could we use this basic algorithm at mount time as well? I know it > > > wouldn't be _exactly_ the same operation at mount time as it is for > > > unmount since slot 9 is a gap in the former case, but afaict the only > > > difference from an algorithmic perspective is the length of the shift. > > > > > > IOW, if we were to parameterize the length of the shift and have the > > > mount fixup call the unmount algorithm, would it not also address the > > > problem? > > > > Yes, I believe that would work. I think it would be more efficient in > > the patch below to memmove the entries instead of put_freelist'ing them > > individually. > > > > If the agfl is completely full then fix_freelist ought to trim it down > > by at least one element. The algorithm then becomes: > > > > if mounting and flfirst < fllast: > > return 0 > > if flcount == agfl_size: > > assert !mounting > > fix_freelist() > > Not sure I follow where the fix_freelist() came in, but I think we need > to be careful here. What if flfirst == 118 and that slot is garbage? > Don't we need to fix up the agfl before we can allow any traditional > operations to proceed? This got confusing, so you might notice in the v2 patch that I separated this out... > > assert flcount < agfl_size > > if flfirst < fllast: > > return 0 > > movelen = agfl_size - flfirst > > if active == flcount - 1: > > movelen-- > > memmove(&agflbno[fllast + 1], &agflbno[flfirst], movelen) > > flfirst = 0 > > fllast = fllast + movelen > > > ... > > > > > > > > > > So we're not attempting to cover the case where the agfl has 1 more > > > > > block than the agfl size (i.e., the case where an fs goes back to a > > > > > kernel with an unpacked header)? > > > > > > > > We don't know how the next kernel to touch this filesystem will define > > > > XFS_AGFL_SIZE -- it could be a 4.5+ kernel (same agfl size), a 32-bit > > > > pre-4.5 kernel (same agfl size), or a 64-bit pre-4.5 kernel (small agfl > > > > size). > > > > > > > > > > I don't think I was clear.. I'm envisioning whether we could come up > > > with a patch that would generically fix up the agfl on disk to be sane > > > relative to the current kernel. This patch covers the case of a packed > > > agfl kernel mounting an unpacked on-disk agfl. It would be nice if we > > > could implement something that also handled a packed on-disk agfl to an > > > unpacked agfl kernel (for easier backport to unpacked kernels, for > > > e.g.). > > > > If we're going to touch an old kernel's source at all I'd rather we > > backport both the packing fix and this fixer-upper. > > > > Not sure I parse... the "old kernel" is essentially the rhel example > where we apparently have deliberately maintained the unpacked format to > avoid this incompatibility problem. If we had a patch that generically > converted on-disk format (packed or unpacked) to the current kernel > (packed or unpacked), we could merge that patch to upstream, stable as > well as distro kernels that might not include the agfl packing fix and > eliminate compatibility problems between them (even if the packing fix > comes in out of order). ...I wonder if we've gotten to the 'talking past each other' point, but let me try one more time. Going forward, I want the number of unpacked kernels to decrease as quickly as possible. I understand that distro kernel maintainers are not willing to apply the packing patch to their kernel until we come up with a smooth transition path. I don't want to support fixing agfls to be 118 units long on 64-bit unpacked kernels and 119 units long on 32-bit unpacked kernels, and I only want to support the packed kernels with their 119 unit long agfls. An AGFL that starts at 0 and ends at flcount-1 is compatible with packed and unpacked kernels, so the v2 patch I sent last night removes the delicate per-case surgery in favor of a new strategy where the mount time and unmount time helpers both look for agfl configurationss that are known to cause problems, and solves them all with the same solution: moving the agfl list towards the start of the block. The fixup patch requires that the kernel already include the packing fix. Its goals are: a) At mount time, ensure that all AGFLs which are not compatible with a packed kernel are made compatible with a packed kernel. b) At unmount time, ensure all AGFLs are compatible with packed kernels and unpacked kernels. Therefore, the first step for a kernel maintainer is to ensure that the packing fix is applied (which it will be for any 4.5+ kernel). The second step is to apply this fixer patch. --D > Otherwise, we need a separate unpacked -> packed fixup for packed > kernels (i.e., this patch) and a packed -> unpacked fixup for unpacked > kernels and to make sure they are used in the right places. Trying to > see if we could avoid this kind of dependency matrix was one of the > objectives around the hack I posted previously. I'm not married to that > particular implementation, but I'm much less concerned about > inefficiency (even if it dictates a mount time fixup over a runtime one) > in comparison to simplicity and flexibility. Perhaps we can accomplish > something similarly flexible via direct buffer manipulation..? > > FWIW, I've appended another variant of the previous hack that is less > brute force, but I think is still able to convert back and forth. The > tradeoff is essentially that it no longer uses the same generic > algorithm. > > Brian > > --- 8< --- > > static int > xfs_agfl_ondisk_size( > struct xfs_mount *mp, > int first, > int last, > int flcount) > { > int active; > int size; > int agfl_size = XFS_AGFL_SIZE(mp); > > if (last >= first) > active = last - first + 1; > else > active = agfl_size - first + last + 1; > > if (active == flcount + 1) > size = agfl_size - 1; > else if ((active == flcount - 1) || > first == agfl_size || last == agfl_size) > size = agfl_size + 1; > else if (active == flcount) > size = agfl_size; > else > return -EFSCORRUPTED; > > return size; > } > > int > xfs_agfl_fixup( > struct xfs_trans *tp, > struct xfs_buf *agbp, > struct xfs_perag *pag) > { > struct xfs_mount *mp = tp->t_mountp; > struct xfs_agf *agf = XFS_BUF_TO_AGF(agbp); > int agfl_size = XFS_AGFL_SIZE(mp); > int ofirst, olast, osize; > int nfirst, nlast; > struct xfs_buf *agflbp; > __be32 *agfl_bno; > xfs_agblock_t bno = -1; > int tidx = -1; > bool empty = false; > int logflags = 0; > int error; > > ofirst = nfirst = be32_to_cpu(agf->agf_flfirst); > olast = nlast = be32_to_cpu(agf->agf_fllast); > osize = xfs_agfl_ondisk_size(mp, ofirst, olast, pag->pagf_flcount); > if (osize < 0) > return osize; > if (pag->pagf_flcount == 0) > empty = true; > > /* sizes match, nothing to do */ > if (osize == agfl_size) > return 0; > > /* size mismatch, read the agfl.. */ > error = xfs_alloc_read_agfl(mp, tp, be32_to_cpu(agf->agf_seqno), > &agflbp); > if (error) > return error; > agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agflbp); > > /* > * If the on-disk agfl is smaller than what the kernel expects, the last > * slot of the on-disk agfl is a gap with bogus data. Allocate the first > * valid block from the agfl, manually place it in the gap and fix up > * the count. > */ > if (osize < agfl_size) { > ASSERT(!empty); > error = xfs_alloc_get_freelist(tp, agbp, &bno, 0); > if (error) > goto out_relse; > > pag->pagf_flcount++; > be32_add_cpu(&agf->agf_flcount, 1); > logflags |= XFS_AGF_FLCOUNT; > tidx = agfl_size - 1; > goto done; > } > > /* > * Otherwise, the on-disk agfl is larger than what the current kernel > * can manage. If the agfl was empty, we just fix up the first and last > * pointers. If not, move the inaccessible block in the last slot to the > * next valid, open slot. > */ > nfirst = do_mod(nfirst, agfl_size); > if (empty) { > nlast = (nfirst == 0 ? agfl_size - 1 : nfirst - 1); > goto done; > } > if (nlast != agfl_size) > nlast++; > nlast = do_mod(nlast, agfl_size); > tidx = nlast; > bno = be32_to_cpu(agfl_bno[osize - 1]); > > done: > if (nfirst != ofirst) { > agf->agf_flfirst = cpu_to_be32(nfirst); > logflags |= XFS_AGF_FLFIRST; > } > if (nlast != olast) { > agf->agf_fllast = cpu_to_be32(nlast); > logflags |= XFS_AGF_FLLAST; > } > if (bno != -1) { > int startoff; > > agfl_bno[tidx] = cpu_to_be32(bno); > xfs_trans_buf_set_type(tp, agflbp, XFS_BLFT_AGFL_BUF); > startoff = (char *) &agfl_bno[tidx] - (char *) agflbp->b_addr; > xfs_trans_log_buf(tp, agflbp, startoff, > startoff + sizeof(xfs_agblock_t) - 1); > } > if (logflags) > xfs_alloc_log_agf(tp, agbp, logflags); > > out_relse: > xfs_trans_brelse(tp, agflbp); > return error; > } > -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html