On Mon, Apr 06, 2020 at 10:06:03AM -0700, Darrick J. Wong wrote: > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote: > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload > > which > > 1. Creates 1,000,000 255-byte sized xattrs, > > 2. Deletes 50% of these xattrs in an alternating manner, > > 3. Tries to create 400,000 new 255-byte sized xattrs > > causes the following message to be printed on the console, > > > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00 > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173 > > > > This indicates that we overflowed the 16-bits wide xattr extent counter. > > > > I have been informed that there are instances where a single file has > > > 100 million hardlinks. With parent pointers being stored in xattr, > > we will overflow the 16-bits wide xattr extent counter when large > > number of hardlinks are created. > > > > Hence this commit extends xattr extent counter to 32-bits. It also introduces > > an incompat flag to prevent older kernels from mounting newer filesystems with > > 32-bit wide xattr extent counter. > > > > Signed-off-by: Chandan Rajendra <chandanrlinux@xxxxxxxxx> > > --- > > fs/xfs/libxfs/xfs_format.h | 28 +++++++++++++++++++++------- > > fs/xfs/libxfs/xfs_inode_buf.c | 27 +++++++++++++++++++-------- > > fs/xfs/libxfs/xfs_inode_fork.c | 3 ++- > > fs/xfs/libxfs/xfs_log_format.h | 5 +++-- > > fs/xfs/libxfs/xfs_types.h | 4 ++-- > > fs/xfs/scrub/inode.c | 7 ++++--- > > fs/xfs/xfs_inode_item.c | 3 ++- > > fs/xfs/xfs_log_recover.c | 13 ++++++++++--- > > 8 files changed, 63 insertions(+), 27 deletions(-) > > > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h > > index 045556e78ee2c..0a4266b0d46e1 100644 > > --- a/fs/xfs/libxfs/xfs_format.h > > +++ b/fs/xfs/libxfs/xfs_format.h > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature( > > #define XFS_SB_FEAT_INCOMPAT_FTYPE (1 << 0) /* filetype in dirent */ > > #define XFS_SB_FEAT_INCOMPAT_SPINODES (1 << 1) /* sparse inode chunks */ > > #define XFS_SB_FEAT_INCOMPAT_META_UUID (1 << 2) /* metadata UUID */ > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3) > > If you're going to introduce an INCOMPAT feature, please also use the > opportunity to convert xattrs to something resembling the dir v3 format, > where we index free space within each block so that we can speed up attr > setting with 100 million attrs. Not necessary. Chandan has already spent a lot of time investigating that - I suggested doing the investigation probably a year ago when he was looking for stuff to do knowing that this could be a problem parent pointers hit. Long story short - there's no degradation in performance in the dabtree out to tens of millions of records with different fixed size or random sized attributes, nor does various combinations of insert/lookup/remove/replace operations seem to impact the tree performance at scale. IOWs, we hit the 16 bit extent limits of the attribute trees without finding any degradation in performance. Hence we concluded that the dabtree structure does not require significant modification or optimisation to work well with typical parent pointer attribute demands... As for free space indexes.... The issue with the directory structure that requires external free space is that the directory data is not part of the dabtree itself. The attribute fork stores all the attributes at the leaves of the dabtree, while the directory structure stores the directory data in external blocks and the dabtree only contains the name hash index that points to the external data. i.e. When we add an attribute to the dabtree, we split/merge leaves of the tree based on where the name hash index tells us it needs to be inserted/removed from. i.e. we make space available or collapse sparse leaves of the dabtree as a side effect of inserting or removing objects. The directory structure is very different. The dirents cannot change location as their logical offset into the dir data segment is used as the readdir/seekdir/telldir cookie. Therefore that location is not allowed to change for the life of the dirent and so we can't store them in the leaves of a dabtree indexed in hash order because the offset into the tree would change as other entries are inserted and removed. Hence when we remove dirents, we must leave holes in the data segment so the rest of the dirent data does not change logical offset. The directory name hash index - the dabtree bit - is in a separate segment (the 2nd one). Because it only stores pointers to dirents in the data segment, it doesn't need to leave holes - the dabtree just merge/splits as required as pointers to the dir data segment are added/removed - and has no free space tracking. Hence when we go to add a dirent, we need to find the best free space in the dir data segment to add that dirent. This requires a dir data segment free space index, and that is held in the 3rd dir segment. Once we've found the best free space via lookup in the free space index, we go modify the dir data block it points to, then update the dabtree to point the name hash at that new dirent. IOWs, the requirement for a free space map in the directory structure results from storing the dirent data externally to the dabtree. Attributes are stored directly in the leaves of the dabtree - except for remote attributes which can be anywhere in the BMBT address space - and hence do no need external free space tracking to determine where to best insert them... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx