Hi Brian, Thank you for your response. On Tue, Nov 26, 2019 at 1:54 PM Brian Foster <bfoster@xxxxxxxxxx> wrote: > > On Tue, Nov 26, 2019 at 10:49:22AM +0200, Alex Lyakas wrote: > > Hi Brian, > > > > On Mon, Nov 25, 2019 at 3:07 PM Brian Foster <bfoster@xxxxxxxxxx> wrote: > > > > > > On Sun, Nov 24, 2019 at 11:13:09AM +0200, Alex Lyakas wrote: > > > > Hi Brian, > > > > > > > > Thank you for your response. > > > > > > > > On Fri, Nov 22, 2019 at 5:43 PM Brian Foster <bfoster@xxxxxxxxxx> wrote: > > > > > > > > > > On Thu, Nov 21, 2019 at 08:08:19PM +0200, Alex Lyakas wrote: > > > > > > We are hitting the following issue: if XFS is mounted with sunit/swidth different from those > > > > > > specified during mkfs, then xfs_repair reports false corruption and eventually segfaults. > > > > > > > > > > > > Example: > > > > > > > > > > > > # mkfs > > > > > > mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d sunit=64,swidth=64 -l sunit=32 /dev/vda > > > > > > > > > > > > #mount with a different sunit/swidth: > > > > > > mount -onoatime,sync,nouuid,sunit=32,swidth=32 /dev/vda /mnt/xfs > > > > > > > > > > > > > > > > FYI, I couldn't reproduce this at first because sparse inodes is enabled > > > > > by default and that introduces more strict inode alignment requirements. > > > > > I'm assuming that sparse inodes is disabled in your example, but it > > > > > would be more helpful if you included the exact configuration and mkfs > > > > > output in such reports. > > > > Providing more details about configuration: > > > > > > > > kernel: 4.14.99 > > > > xfsprogs: 4.9.0+nmu1ubuntu2 > > > > content of /etc/zadara/xfs.protofile is captured in [1] > > > > mkfs output is captured in [2] > > > > > > > > > > > > > > > #umount > > > > > > umount /mnt/xfs > > > > > > > > > > > ... > > > > > > > > > > > > Looking at the kernel code of XFS, there seems to be no need to update the superblock sunit/swidth if the mount-provided sunit/swidth are different. > > > > > > The superblock values are not used during runtime. > > > > > > > > > > > > > > > > I'm not really sure what the right answer is here. On one hand, this > > > > > patch seems fundamentally reasonable to me. I find it kind of odd for > > > > > mount options to override and persist configuration set in the > > > > > superblock like this. OTOH, this changes a historical behavior which may > > > > > (or may not) cause disruption for users. I also think it's somewhat > > > > > unfortunate to change kernel mount option behavior to accommodate > > > > > repair, > > > > This is very critical for us to have a working repair in production. I > > > > presume the same is true for most users. > > > > > > > > > > Of course. > > > > > > > > but I think the mount option behavior being odd argument stands > > > > > on its own regardless. > > > > > > > > > > What is your actual use case for changing the stripe unit/width at mount > > > > > time like this? > > > > Our use case is like this: during mkfs the overall system does not > > > > know yet the exact sunit/swidth to be used. Also, the underlying > > > > storage can change its sunit/swidth alignment as part of some storage > > > > migration scenarios. During mount we already know the proper > > > > sunit/swidth. But the problem is that in order to specify sunit/swidth > > > > during mount, XFS superblock must be marked as "supporting data > > > > alignment", i.e., XFS_SB_VERSION_DALIGNBIT has to be set. Otherwise, > > > > XFS refuses to mount and says: > > > > > > > > xfs_warn(mp, > > > > "cannot change alignment: superblock does not support data alignment"); > > > > return -EINVAL; > > > > > > > > In order for the superblock to be marked like this, *some* > > > > sunit/swidth need to be specified during mkfs. > > > > > > > > > > What do you mean by "the overall system doesn't know the stripe unit > > > values" at mkfs time? Shouldn't these values originate from the > > > underlying storage? Is there some additional configuration at play in > > > this particular use case? > > I checked the code, and it turns out that the only reason for > > specifying sunit/swidth during mkfs is to mark the superblock as > > "supporting data alignment". The real sunit/swidth is always > > determined during mount, based on the underlying storage. The > > significant property of the storage is the deduplication granularity; > > we saw that we get best deduplication results when XFS sunit/swidth > > are aligned to deduplication granularity. During storage migrations, > > an XFS share can be migrated to a storage with different deduplication > > parameters; hence we need to be able to update the runtime > > sunit/swidth. > > > > Ok, but that doesn't really answer my question of why can't you set it > at mkfs time? Presumably you have the same access to the storage at mkfs > time as at mount time? Yes, we have the access and we can set the proper sunit/swidth during mkfs. But our thinking was that during mkfs we want only to mark the superblock as "supporting data alignment", which is what specifying sunit/swidth during mkfs does. During mount, we look at the underlying storage config, and pass the proper sunit/swidth. The underlying storage could be different from one that existed at mkfs time. I hope this explains our situation better. >I can't really speak to how your dedup stuff > accounts for dynamic changes in relation to existing data that was > placed based an old stripe unit, etc., so I'll just assume that works > for your purposes in the migration case. > > Note that Dave chimed in with some historical context on IRC: > > ... > 14:31 <dchinner_> keep in mind that the sunit/swidth mount options were only added to work around a CXFS client mount bug that caused the superblock values on disk to get zeroed > 14:31 <sandeen> hah, that rings a bell > 14:32 <sandeen> still, it's been there for /years/ now > 14:32 <dchinner_> it was never intended as a mechanism to actually change the layout > 14:32 <djwong> aha, there's the context i was missing :) > 14:32 <dchinner_> it turned out to be kinda useful for the crazy md raid restripers > 14:32 <dchinner_> but, really, the use it's being put to here is just .... not supportable > 14:34 <sandeen> usecase aside, it seems that if the kernel code allows you to rewrite it, xfs_repair shouldn't choke on the results > 14:34 <dchinner_> yes, you can do it, but you get to keep all the broken bits when your crazy storage changes alignments and layouts dynamically. > 14:34 <dchinner_> the kernel doesn't validate it properly > 14:34 <sandeen> ok so something else to fix > 14:35 <dchinner_> so if we are going to say "you can change it" then the kernel must validate that it is compatible with the existing on-disk layout > 14:35 <dchinner_> i.e. root inode location, ag header location, inode cluster size and alignment, etc > ... > Fascinating! >From our perspective, when the underlying storage alignment changes, we are ok for all already-allocated extents to keep their existing alignment. We only want the newly-allocated extents to use the newly-specified sunit/swidth. My understanding is that such filesystem is not "broken bits", as everything is consistent, except different extents may have different alignments. > My takeaway from that is the behavior of updating the superblock from > the mount options could probably be nuked off (as your patch does), but > I'd suggest to get feedback from Eric and Darrick to see whether they > agree or would prefer to maintain existing behavior with proper > validation... Let's wait for their additional feedback, then. Thanks, Alex. > > Brian > > > Thanks, > > Alex. > > > > > > > > > > The migration use case certainly sounds valid, though my understanding > > > is that if an existing stripe configuration is in place, it's usually > > > best for the new storage configuration to take that into consideration > > > (and use some multiple of stripe unit or something, for example). > > > > > > > > > > > > > > With the suggested patch, xfs repair is working properly also when mount-provided sunit/swidth are different. > > > > > > > > > > > > However, I am not sure whether this is the proper approach. Otherwise, should we not allow specifying different sunit/swidth during mount? > > > > > > > > > > > ... > > > > > > > > > > > > Signed-off-by: Alex Lyakas <alex@xxxxxxxxxx> > > > > > > --- > > > > > > fs/xfs/xfs_mount.c | 18 ++++++------------ > > > > > > 1 file changed, 6 insertions(+), 12 deletions(-) > > > > > > > > > > > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c > > > > > > index ba5b6f3..e8263b4 100644 > > > > > > --- a/fs/xfs/xfs_mount.c > > > > > > +++ b/fs/xfs/xfs_mount.c > > > > > > @@ -399,19 +399,13 @@ > > > > > > } > > > > > > > > > > > > /* > > > > > > - * Update superblock with new values > > > > > > - * and log changes > > > > > > + * If sunit/swidth specified during mount do not match > > > > > > + * those in the superblock, use the mount-specified values, > > > > > > + * but do not update the superblock. > > > > > > + * Otherwise, xfs_repair reports false corruption. > > > > > > + * Here, only verify that superblock supports data alignment. > > > > > > */ > > > > > > - if (xfs_sb_version_hasdalign(sbp)) { > > > > > > - if (sbp->sb_unit != mp->m_dalign) { > > > > > > - sbp->sb_unit = mp->m_dalign; > > > > > > - mp->m_update_sb = true; > > > > > > - } > > > > > > - if (sbp->sb_width != mp->m_swidth) { > > > > > > - sbp->sb_width = mp->m_swidth; > > > > > > - mp->m_update_sb = true; > > > > > > - } > > > > > > - } else { > > > > > > + if (!xfs_sb_version_hasdalign(sbp)) { > > > > > > > > > > Would this change xfs_info behavior on a filesystem mounted with > > > > > different runtime fields from the superblock? I haven't tested it, but > > > > > it looks like we pull the fields from the superblock. > > > > xfs_info uses XFS_IOC_FSGEOMETRY to get the values, and this pulls the > > > > values from the run-time copy of the superblock: > > > > > > > > int > > > > xfs_fs_geometry( > > > > xfs_mount_t *mp, > > > > xfs_fsop_geom_t *geo, > > > > int new_version) > > > > ... > > > > if (new_version >= 2) { > > > > geo->sunit = mp->m_sb.sb_unit; > > > > geo->swidth = mp->m_sb.sb_width; > > > > } > > > > > > > > So if during mount we have updated the superblock, we will pull the > > > > updated values. If we do not update (as the proposed patch), we will > > > > report the values stored in the superblock. Perhaps we need to update > > > > the geomtery ioctl to report runtime values? > > > > > > > > > > Ok, thanks. I'd argue that we should return the runtime results if we > > > took the proposed approach, but we should probably settle on a solution > > > first.. > > > > > > Brian > > > > > > > Thanks, > > > > Alex. > > > > > > > > > > > > > > Brian > > > > > > > > > > > xfs_warn(mp, > > > > > > "cannot change alignment: superblock does not support data alignment"); > > > > > > return -EINVAL; > > > > > > -- > > > > > > 1.9.1 > > > > > > > > > > > > > > > > > > > [1] > > > > root@vc-20-01-48-dev:~# cat /etc/zadara/xfs.protofile > > > > dummy : bootfilename, not used, backward compatibility > > > > 0 0 : numbers of blocks and inodes, not > > > > used, backward compatibility > > > > d--777 0 0 : set 777 perms for the root dir > > > > $ > > > > $ > > > > > > > > [2] > > > > root@vc-20-01-48-dev:~# mkfs.xfs -f -K -p /etc/zadara/xfs.protofile -d > > > > sunit=64,swidth=64 -l sunit=32 /dev/vda > > > > meta-data=/dev/vda isize=512 agcount=16, agsize=163832 blks > > > > = sectsz=512 attr=2, projid32bit=1 > > > > = crc=1 finobt=1, sparse=0, > > > > rmapbt=0, reflink=0 > > > > data = bsize=4096 blocks=2621312, imaxpct=25 > > > > = sunit=8 swidth=8 blks > > > > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > > > > log =internal log bsize=4096 blocks=2560, version=2 > > > > = sectsz=512 sunit=4 blks, lazy-count=1 > > > > realtime =none extsz=4096 blocks=0, rtextents=0 > > > > > > > > > >