On Sun, Dec 16, 2012 at 11:39:53PM +0100, Andras Korn wrote: > Hi, > > I'm a user of xfs and linux-vserver (http://linux-vserver.org/). > > I'd like vserver to work better with xfs (or vice versa) and am trying to > proxy between the two development communities (which in the case of vserver > is not very large). > > vservers are basically chroots on steroids: the host runs a single kernel, > but it isolates processes running in the guests from each other. There is a > "copy on write hardlink breaking" feature that allows you to hardlink files > (such as libc6) of different guests together (so that they only get mapped > into memory once), and have the kernel break the link if the inode is opened > for writing (by creating a copy and returning a FD to the copy). > > This feature relies on inode flags (like the 'immutable' bit). vserver adds > two fields to the inode (the other is used to tag inodes with a context ID). > > The kernel parts work, but xfs_repair breaks such filesystems because it > thinks the flags are invalid. > > I approached David Chinner and Eric Sandeen about this on IRC, and they said > that the first step towards any improvement would be to share with this list > the parts of the vserver patch that affect xfs, so that's what I'm doing > now. > > Please find attached the output of > > filterdiff -i '*xfs*' patch-3.7-vs2.3.5.1.diff Best to include patches in line so they are easy to quote and comment on. The main issue here is that it includes lots of bits that aren't in the mainline kernel, so there are bits that we cannot push into the mainline kernel. What we really need to do is get the bits that change the on-disk format formalised, and then we can ensure that the userspace tools know about these bits and can query/validate them properly. So, what htat means is that we need to sort out bits like: > --- linux-3.7/fs/xfs/xfs_dinode.h 2012-10-04 13:27:44.000000000 +0000 > +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_dinode.h 2012-12-11 15:56:32.000000000 +0000 > @@ -51,7 +51,9 @@ typedef struct xfs_dinode { > __be32 di_nlink; /* number of links to file */ > __be16 di_projid_lo; /* lower part of owner's project id */ > __be16 di_projid_hi; /* higher part owner's project id */ > - __u8 di_pad[6]; /* unused, zeroed space */ > + __u8 di_pad[2]; /* unused, zeroed space */ > + __be16 di_tag; /* context tagging */ > + __be16 di_vflags; /* vserver specific flags */ > __be16 di_flushiter; /* incremented on flush */ > xfs_timestamp_t di_atime; /* time last accessed */ > xfs_timestamp_t di_mtime; /* time last modified */ > @@ -184,6 +186,8 @@ static inline void xfs_dinode_put_rdev(s > #define XFS_DIFLAG_EXTSZINHERIT_BIT 12 /* inherit inode extent size */ > #define XFS_DIFLAG_NODEFRAG_BIT 13 /* do not reorganize/defragment */ > #define XFS_DIFLAG_FILESTREAM_BIT 14 /* use filestream allocator */ > +#define XFS_DIFLAG_IXUNLINK_BIT 15 /* Immutable inver on unlink */ > + > #define XFS_DIFLAG_REALTIME (1 << XFS_DIFLAG_REALTIME_BIT) > #define XFS_DIFLAG_PREALLOC (1 << XFS_DIFLAG_PREALLOC_BIT) > #define XFS_DIFLAG_NEWRTBM (1 << XFS_DIFLAG_NEWRTBM_BIT) > @@ -199,6 +203,7 @@ static inline void xfs_dinode_put_rdev(s > #define XFS_DIFLAG_EXTSZINHERIT (1 << XFS_DIFLAG_EXTSZINHERIT_BIT) > #define XFS_DIFLAG_NODEFRAG (1 << XFS_DIFLAG_NODEFRAG_BIT) > #define XFS_DIFLAG_FILESTREAM (1 << XFS_DIFLAG_FILESTREAM_BIT) > +#define XFS_DIFLAG_IXUNLINK (1 << XFS_DIFLAG_IXUNLINK_BIT) > > #ifdef CONFIG_XFS_RT > #define XFS_IS_REALTIME_INODE(ip) ((ip)->i_d.di_flags & XFS_DIFLAG_REALTIME) > @@ -211,6 +216,10 @@ static inline void xfs_dinode_put_rdev(s > XFS_DIFLAG_IMMUTABLE | XFS_DIFLAG_APPEND | XFS_DIFLAG_SYNC | \ > XFS_DIFLAG_NOATIME | XFS_DIFLAG_NODUMP | XFS_DIFLAG_RTINHERIT | \ > XFS_DIFLAG_PROJINHERIT | XFS_DIFLAG_NOSYMLINKS | XFS_DIFLAG_EXTSIZE | \ > - XFS_DIFLAG_EXTSZINHERIT | XFS_DIFLAG_NODEFRAG | XFS_DIFLAG_FILESTREAM) > + XFS_DIFLAG_EXTSZINHERIT | XFS_DIFLAG_NODEFRAG | XFS_DIFLAG_FILESTREAM | \ > + XFS_DIFLAG_IXUNLINK) > + > +#define XFS_DIVFLAG_BARRIER 0x01 > +#define XFS_DIVFLAG_COW 0x02 > These definitions and on disk format changes, and wrap them in a manner that allows us to support them without issue. The main problem I see is that these bits are specific to vserver so if they are set on a mainline kernel/filesytsem, they should never be set. That means we really need a superblock feature bit to indicate whether these changes are valid or not. A bit issue is this bit: #define XFS_DIFLAG_IXUNLINK_BIT 15 /* Immutable inver on unlink */ Which takes the last bit of the di_flags field. We've been reserving that bit to be used as a "more bits bit" to indicate the presence of an extra flags field in the inode code (similar to the superblock XFS_SB_VERSION_MOREBITSBIT feature bit). That would allow us not to have to use a superblock feature bit to indicate the precense of the new field in the inode. Unfortunately, this bit is taken, so AFAICT the only way we can merge this into the upstream code is to add a superblock feature bit at the same time. But that then makes the upstream code still incompatible with the vserver code, because none of the vserver filesystems will have the feature bit set. That can be worked around (e.g. the vserver patchset includes a new piece of code that sets the feature bit on mount), but it does make it somewhat difficult to cleanly support the extensions that vserver has made... To complicate things further, then new inode format for CRC changes already has a new flags field added to it. Basically, I was not intending to add a new flags field to the existing inode format *ever* as new feature bits woul donly be supported on new format filesystems. I'm not sure what the best way to reconcile this is - I really don't want to have to support 3 separate flags fields, 2 of which are optional... > +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_inode.c 2012-12-11 22:20:23.000000000 +0000 > @@ -16,6 +16,7 @@ > * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > */ > #include <linux/log2.h> > +#include <linux/vs_tag.h> > > #include "xfs.h" > #include "xfs_fs.h" > @@ -563,15 +564,25 @@ xfs_iformat_btree( > STATIC void > xfs_dinode_from_disk( > xfs_icdinode_t *to, > - xfs_dinode_t *from) > + xfs_dinode_t *from, > + int tagged) > { > + uint32_t uid, gid, tag; > + > to->di_magic = be16_to_cpu(from->di_magic); > to->di_mode = be16_to_cpu(from->di_mode); > to->di_version = from ->di_version; > to->di_format = from->di_format; > to->di_onlink = be16_to_cpu(from->di_onlink); > - to->di_uid = be32_to_cpu(from->di_uid); > - to->di_gid = be32_to_cpu(from->di_gid); > + > + uid = be32_to_cpu(from->di_uid); > + gid = be32_to_cpu(from->di_gid); > + tag = be16_to_cpu(from->di_tag); > + > + to->di_uid = INOTAG_UID(tagged, uid, gid); > + to->di_gid = INOTAG_GID(tagged, uid, gid); > + to->di_tag = INOTAG_TAG(tagged, uid, gid, tag); Changes like this will still have to exist solely in the vserver tree as they rely on core changes in the vserver tree. What is important is that on-disk formats are compatible between the two trees.... > @@ -669,6 +689,10 @@ _xfs_dic2xflags( > flags |= XFS_XFLAG_FILESTREAM; > } > > + if (di_vflags & XFS_DIVFLAG_BARRIER) > + flags |= FS_BARRIER_FL; > + if (di_vflags & XFS_DIVFLAG_COW) > + flags |= FS_COW_FL; > return flags; Similarly for bits like this. > --- linux-3.7/fs/xfs/xfs_ioctl.c 2012-12-11 15:47:37.000000000 +0000 > +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_ioctl.c 2012-12-11 15:56:32.000000000 +0000 > @@ -26,7 +26,7 @@ > #include "xfs_bmap_btree.h" > #include "xfs_dinode.h" > #include "xfs_inode.h" > -#include "xfs_ioctl.h" > +// #include "xfs_ioctl.h" > #include "xfs_rtalloc.h" > #include "xfs_itable.h" > #include "xfs_error.h" That doesn't look right.... > @@ -892,6 +900,10 @@ xfs_diflags_to_linux( > inode->i_flags |= S_IMMUTABLE; > else > inode->i_flags &= ~S_IMMUTABLE; > + if (xflags & XFS_XFLAG_IXUNLINK) > + inode->i_flags |= S_IXUNLINK; > + else > + inode->i_flags &= ~S_IXUNLINK; > if (xflags & XFS_XFLAG_APPEND) > inode->i_flags |= S_APPEND; > else > @@ -1396,10 +1408,18 @@ xfs_file_ioctl( > case XFS_IOC_FSGETXATTRA: > return xfs_ioc_fsgetxattr(ip, 1, arg); > case XFS_IOC_FSSETXATTR: > + if (IS_BARRIER(inode)) { > + vxwprintk_task(1, "messing with the barrier."); > + return -XFS_ERROR(EACCES); > + } > return xfs_ioc_fssetxattr(ip, filp, arg); > case XFS_IOC_GETXFLAGS: > return xfs_ioc_getxflags(ip, arg); > case XFS_IOC_SETXFLAGS: > + if (IS_BARRIER(inode)) { > + vxwprintk_task(1, "messing with the barrier."); > + return -XFS_ERROR(EACCES); > + } > return xfs_ioc_setxflags(ip, filp, arg); And these all rely on vserver infrastructure, so would have to remain in the vserver tree.... > > case XFS_IOC_FSSETDM: { > --- linux-3.7/fs/xfs/xfs_ioctl.h 2011-10-24 16:45:31.000000000 +0000 > +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_ioctl.h 2012-12-11 15:56:32.000000000 +0000 > @@ -70,6 +70,12 @@ xfs_handle_to_dentry( > void __user *uhandle, > u32 hlen); > > +extern int > +xfs_sync_flags( > + struct inode *inode, > + int flags, > + int vflags); > + > extern long > xfs_file_ioctl( > struct file *filp, > --- linux-3.7/fs/xfs/xfs_iops.c 2012-10-04 13:27:44.000000000 +0000 > +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_iops.c 2012-12-11 15:56:32.000000000 +0000 > @@ -28,6 +28,7 @@ > #include "xfs_bmap_btree.h" > #include "xfs_dinode.h" > #include "xfs_inode.h" > +#include "xfs_ioctl.h" > #include "xfs_bmap.h" > #include "xfs_rtalloc.h" > #include "xfs_error.h" > @@ -46,6 +47,7 @@ > #include <linux/security.h> > #include <linux/fiemap.h> > #include <linux/slab.h> > +#include <linux/vs_tag.h> > > static int > xfs_initxattrs( > @@ -421,6 +423,7 @@ xfs_vn_getattr( > stat->nlink = ip->i_d.di_nlink; > stat->uid = ip->i_d.di_uid; > stat->gid = ip->i_d.di_gid; > + stat->tag = ip->i_d.di_tag; > stat->ino = ip->i_ino; > stat->atime = inode->i_atime; > stat->mtime = inode->i_mtime; > @@ -1033,6 +1036,7 @@ static const struct inode_operations xfs > .listxattr = xfs_vn_listxattr, > .fiemap = xfs_vn_fiemap, > .update_time = xfs_vn_update_time, > + .sync_flags = xfs_sync_flags, > }; > > static const struct inode_operations xfs_dir_inode_operations = { > @@ -1059,6 +1063,7 @@ static const struct inode_operations xfs > .removexattr = generic_removexattr, > .listxattr = xfs_vn_listxattr, > .update_time = xfs_vn_update_time, > + .sync_flags = xfs_sync_flags, > }; as would all these "sync_flag" changes. > --- linux-3.7/fs/xfs/xfs_super.c 2012-12-11 15:47:37.000000000 +0000 > +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_super.c 2012-12-11 17:36:47.000000000 +0000 > @@ -114,6 +114,9 @@ mempool_t *xfs_ioend_pool; > #define MNTOPT_NODELAYLOG "nodelaylog" /* Delayed logging disabled */ > #define MNTOPT_DISCARD "discard" /* Discard unused blocks */ > #define MNTOPT_NODISCARD "nodiscard" /* Do not discard unused blocks */ > +#define MNTOPT_TAGXID "tagxid" /* context tagging for inodes */ > +#define MNTOPT_TAGGED "tag" /* context tagging for inodes */ > +#define MNTOPT_NOTAGTAG "notag" /* do not use context tagging */ ..... > @@ -1149,6 +1170,16 @@ xfs_fs_remount( > case Opt_inode32: > mp->m_maxagi = xfs_set_inode32(mp); > break; > + case Opt_tag: > + if (!(sb->s_flags & MS_TAGGED)) { > + printk(KERN_INFO > + "XFS: %s: tagging not permitted on remount.\n", > + sb->s_id); > + return -EINVAL; > + } > + break; > + case Opt_notag: > + break; > default: > /* > * Logically we would return an error here to prevent > @@ -1368,6 +1399,9 @@ xfs_fs_fill_super( > if (error) > goto out_free_sb; > > + if (mp->m_flags & XFS_MOUNT_TAGGED) > + sb->s_flags |= MS_TAGGED; > + Why wouldn't you use vfs-based mount option parsing here and hence not need XFS_MOUNT_TAGGED or all the duplicated parsing in each filesystem? This seems like you all this could be removed from the patch, and the XFS code just checks if (mp->m_super->s_flags & MS_TAGGED) is true.... > /* > * we must configure the block size in the superblock before we run the > * full mount process as the mount process can lookup and cache inodes. > --- linux-3.7/fs/xfs/xfs_vnodeops.c 2012-10-04 13:27:44.000000000 +0000 > +++ linux-3.7-vs2.3.5.1/fs/xfs/xfs_vnodeops.c 2012-12-11 15:56:32.000000000 +0000 > @@ -103,6 +103,77 @@ xfs_readlink_bmap( > return error; > } > > + > +STATIC void > +xfs_get_inode_flags( > + xfs_inode_t *ip) > +{ > + struct inode *inode = VFS_I(ip); > + unsigned int flags = inode->i_flags; > + unsigned int vflags = inode->i_vflags; > + > + if (flags & S_IMMUTABLE) > + ip->i_d.di_flags |= XFS_DIFLAG_IMMUTABLE; > + else > + ip->i_d.di_flags &= ~XFS_DIFLAG_IMMUTABLE; > + if (flags & S_IXUNLINK) > + ip->i_d.di_flags |= XFS_DIFLAG_IXUNLINK; > + else > + ip->i_d.di_flags &= ~XFS_DIFLAG_IXUNLINK; > + > + if (vflags & V_BARRIER) > + ip->i_d.di_vflags |= XFS_DIVFLAG_BARRIER; > + else > + ip->i_d.di_vflags &= ~XFS_DIVFLAG_BARRIER; > + if (vflags & V_COW) > + ip->i_d.di_vflags |= XFS_DIVFLAG_COW; > + else > + ip->i_d.di_vflags &= ~XFS_DIVFLAG_COW; > +} > + > +int > +xfs_sync_flags( > + struct inode *inode, > + int flags, > + int vflags) > +{ > + struct xfs_inode *ip = XFS_I(inode); > + struct xfs_mount *mp = ip->i_mount; > + struct xfs_trans *tp; > + unsigned int lock_flags = 0; > + int code; > + > + tp = xfs_trans_alloc(mp, XFS_TRANS_SETATTR_NOT_SIZE); > + code = xfs_trans_reserve(tp, 0, XFS_ICHANGE_LOG_RES(mp), 0, 0, 0); > + if (code) > + goto error_out; > + > + xfs_ilock(ip, XFS_ILOCK_EXCL); > + xfs_trans_ijoin(tp, ip, 0); > + > + inode->i_flags = flags; > + inode->i_vflags = vflags; > + xfs_get_inode_flags(ip); > + > + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); > + xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_CHG); > + > + XFS_STATS_INC(xs_ig_attrchg); > + > + if (mp->m_flags & XFS_MOUNT_WSYNC) > + xfs_trans_set_sync(tp); > + code = xfs_trans_commit(tp, 0); > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > + return code; > + > +error_out: > + xfs_trans_cancel(tp, 0); > + if (lock_flags) > + xfs_iunlock(ip, XFS_ILOCK_EXCL); > + return code; > +} Seems strange to ad a "sync_flags" method like this setting/clearing the flags already has a generic method via FS_IOC_SETFLAGS. Regardless, this is something that would need to be kept in the vserver tree... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs