Re: [RFC-PATCH] xfs: do not update sunit/swidth in the superblock to match those provided during mount

Alex Lyakas <alex@xxxxxxxxxx> · Tue, 26 Nov 2019 10:49:22 +0200

Hi Brian,

On Mon, Nov 25, 2019 at 3:07 PM Brian Foster <bfoster@xxxxxxxxxx> wrote:
>
> On Sun, Nov 24, 2019 at 11:13:09AM +0200, Alex Lyakas wrote:
> > Hi Brian,
> >
> > Thank you for your response.
> >
> > On Fri, Nov 22, 2019 at 5:43 PM Brian Foster <bfoster@xxxxxxxxxx> wrote:
> > >
> > > On Thu, Nov 21, 2019 at 08:08:19PM +0200, Alex Lyakas wrote:
> > > > We are hitting the following issue: if XFS is mounted with sunit/swidth different from those
> > > > specified during mkfs, then xfs_repair reports false corruption and eventually segfaults.
> > > >
> > > > Example:
> > > >
> > > > # mkfs
> > > > mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d sunit=64,swidth=64 -l sunit=32 /dev/vda
> > > >
> > > > #mount with a different sunit/swidth:
> > > > mount -onoatime,sync,nouuid,sunit=32,swidth=32 /dev/vda /mnt/xfs
> > > >
> > >
> > > FYI, I couldn't reproduce this at first because sparse inodes is enabled
> > > by default and that introduces more strict inode alignment requirements.
> > > I'm assuming that sparse inodes is disabled in your example, but it
> > > would be more helpful if you included the exact configuration and mkfs
> > > output in such reports.
> > Providing more details about configuration:
> >
> > kernel: 4.14.99
> > xfsprogs: 4.9.0+nmu1ubuntu2
> > content of /etc/zadara/xfs.protofile is captured in [1]
> > mkfs output is captured in [2]
> >
> > >
> > > > #umount
> > > > umount /mnt/xfs
> > > >
> > > ...
> > > >
> > > > Looking at the kernel code of XFS, there seems to be no need to update the superblock sunit/swidth if the mount-provided sunit/swidth are different.
> > > > The superblock values are not used during runtime.
> > > >
> > >
> > > I'm not really sure what the right answer is here. On one hand, this
> > > patch seems fundamentally reasonable to me. I find it kind of odd for
> > > mount options to override and persist configuration set in the
> > > superblock like this. OTOH, this changes a historical behavior which may
> > > (or may not) cause disruption for users. I also think it's somewhat
> > > unfortunate to change kernel mount option behavior to accommodate
> > > repair,
> > This is very critical for us to have a working repair in production. I
> > presume the same is true for most users.
> >
>
> Of course.
>
> > > but I think the mount option behavior being odd argument stands
> > > on its own regardless.
> > >
> > > What is your actual use case for changing the stripe unit/width at mount
> > > time like this?
> > Our use case is like this: during mkfs the overall system does not
> > know yet the exact sunit/swidth to be used. Also, the underlying
> > storage can change its sunit/swidth alignment as part of some storage
> > migration scenarios. During mount we already know the proper
> > sunit/swidth. But the problem is that in order to specify sunit/swidth
> > during mount, XFS superblock must be marked as "supporting data
> > alignment", i.e., XFS_SB_VERSION_DALIGNBIT has to be set. Otherwise,
> > XFS refuses to mount and says:
> >
> >             xfs_warn(mp,
> >     "cannot change alignment: superblock does not support data alignment");
> >             return -EINVAL;
> >
> > In order for the superblock to be marked like this, *some*
> > sunit/swidth need to be specified during mkfs.
> >
>
> What do you mean by "the overall system doesn't know the stripe unit
> values" at mkfs time? Shouldn't these values originate from the
> underlying storage? Is there some additional configuration at play in
> this particular use case?
I checked the code, and it turns out that the only reason for
specifying sunit/swidth during mkfs is to mark the superblock as
"supporting data alignment". The real sunit/swidth is always
determined during mount, based on the underlying storage. The
significant property of the storage is the deduplication granularity;
we saw that we get best deduplication results when XFS sunit/swidth
are aligned to deduplication granularity. During storage migrations,
an XFS share can be migrated to a storage with different deduplication
parameters; hence we need to be able to update the runtime
sunit/swidth.

Thanks,
Alex.

>
> The migration use case certainly sounds valid, though my understanding
> is that if an existing stripe configuration is in place, it's usually
> best for the new storage configuration to take that into consideration
> (and use some multiple of stripe unit or something, for example).
>
> > >
> > > > With the suggested patch, xfs repair is working properly also when mount-provided sunit/swidth are different.
> > > >
> > > > However, I am not sure whether this is the proper approach. Otherwise, should we not allow specifying different sunit/swidth during mount?
> > > >
> > > ...
> > > >
> > > > Signed-off-by: Alex Lyakas <alex@xxxxxxxxxx>
> > > > ---
> > > >  fs/xfs/xfs_mount.c | 18 ++++++------------
> > > >  1 file changed, 6 insertions(+), 12 deletions(-)
> > > >
> > > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > > index ba5b6f3..e8263b4 100644
> > > > --- a/fs/xfs/xfs_mount.c
> > > > +++ b/fs/xfs/xfs_mount.c
> > > > @@ -399,19 +399,13 @@
> > > >               }
> > > >
> > > >               /*
> > > > -              * Update superblock with new values
> > > > -              * and log changes
> > > > +              * If sunit/swidth specified during mount do not match
> > > > +              * those in the superblock, use the mount-specified values,
> > > > +              * but do not update the superblock.
> > > > +              * Otherwise, xfs_repair reports false corruption.
> > > > +              * Here, only verify that superblock supports data alignment.
> > > >                */
> > > > -             if (xfs_sb_version_hasdalign(sbp)) {
> > > > -                     if (sbp->sb_unit != mp->m_dalign) {
> > > > -                             sbp->sb_unit = mp->m_dalign;
> > > > -                             mp->m_update_sb = true;
> > > > -                     }
> > > > -                     if (sbp->sb_width != mp->m_swidth) {
> > > > -                             sbp->sb_width = mp->m_swidth;
> > > > -                             mp->m_update_sb = true;
> > > > -                     }
> > > > -             } else {
> > > > +             if (!xfs_sb_version_hasdalign(sbp)) {
> > >
> > > Would this change xfs_info behavior on a filesystem mounted with
> > > different runtime fields from the superblock? I haven't tested it, but
> > > it looks like we pull the fields from the superblock.
> > xfs_info uses XFS_IOC_FSGEOMETRY to get the values, and this pulls the
> > values from the run-time copy of the superblock:
> >
> > int
> > xfs_fs_geometry(
> >     xfs_mount_t        *mp,
> >     xfs_fsop_geom_t        *geo,
> >     int            new_version)
> > ...
> >     if (new_version >= 2) {
> >         geo->sunit = mp->m_sb.sb_unit;
> >         geo->swidth = mp->m_sb.sb_width;
> >     }
> >
> > So if during mount we have updated the superblock, we will pull the
> > updated values. If we do not update (as the proposed patch), we will
> > report the values stored in the superblock. Perhaps we need to update
> > the geomtery ioctl to report runtime values?
> >
>
> Ok, thanks. I'd argue that we should return the runtime results if we
> took the proposed approach, but we should probably settle on a solution
> first..
>
> Brian
>
> > Thanks,
> > Alex.
> >
> > >
> > > Brian
> > >
> > > >                       xfs_warn(mp,
> > > >       "cannot change alignment: superblock does not support data alignment");
> > > >                       return -EINVAL;
> > > > --
> > > > 1.9.1
> > > >
> > >
> >
> > [1]
> > root@vc-20-01-48-dev:~# cat /etc/zadara/xfs.protofile
> > dummy                   : bootfilename, not used, backward compatibility
> > 0 0                             : numbers of blocks and inodes, not
> > used, backward compatibility
> > d--777 0 0              : set 777 perms for the root dir
> > $
> > $
> >
> > [2]
> > root@vc-20-01-48-dev:~# mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d
> > sunit=64,swidth=64 -l sunit=32 /dev/vda
> > meta-data=/dev/vda               isize=512    agcount=16, agsize=163832 blks
> >          =                       sectsz=512   attr=2, projid32bit=1
> >          =                       crc=1        finobt=1, sparse=0,
> > rmapbt=0, reflink=0
> > data     =                       bsize=4096   blocks=2621312, imaxpct=25
> >          =                       sunit=8      swidth=8 blks
> > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > log      =internal log           bsize=4096   blocks=2560, version=2
> >          =                       sectsz=512   sunit=4 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> >
>