On Wed, Feb 19, 2020 at 4:25 PM Miklos Szeredi <miklos@xxxxxxxxxx> wrote: > > On Wed, Jan 1, 2020 at 6:58 PM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > > > When xino feature is enabled and a real directory inode number > > overflows the lower xino bits, we cannot map this directory inode > > number to a unique and persistent inode number and we fall back to > > the real inode st_ino and overlay st_dev. > > > > The real inode st_ino with high bits may collide with a lower inode > > number on overlay st_dev that was mapped using xino. > > > > To avoid possible collision with legitimate xino values, map a non > > persistent inode number to a dedicated range in the xino address space. > > The dedicated range is created by adding one more bit to the number of > > reserved high xino bits. We could have added just one more fsid, but > > that would have had the undesired effect of changing persistent overlay > > inode numbers on kernel or require more complex xino mapping code. > > > > The non persistent inode number is allocated with get_next_ino() > > and stored in i_generation. To reduce the burn rate of get_next_ino() > > numbers in the system, we avoid calling get_next_ino() on non dir, > > because we are not going to use it and we reuse i_generation from > > recycled ovl_inode objects. > > > > Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx> > > --- > > fs/overlayfs/inode.c | 54 ++++++++++++++++++++++++++++++++-------- > > fs/overlayfs/overlayfs.h | 7 ++++++ > > fs/overlayfs/super.c | 28 ++++++++++++--------- > > 3 files changed, 66 insertions(+), 23 deletions(-) > > > > diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c > > index 04e8e8de2012..415d9efa4799 100644 > > --- a/fs/overlayfs/inode.c > > +++ b/fs/overlayfs/inode.c > > @@ -79,6 +79,7 @@ static int ovl_map_dev_ino(struct dentry *dentry, struct kstat *stat, int fsid) > > { > > bool samefs = ovl_same_fs(dentry->d_sb); > > unsigned int xinobits = ovl_xino_bits(dentry->d_sb); > > + unsigned int shift = 64 - xinobits; > > > > if (samefs) { > > /* > > @@ -89,7 +90,6 @@ static int ovl_map_dev_ino(struct dentry *dentry, struct kstat *stat, int fsid) > > stat->dev = dentry->d_sb->s_dev; > > return 0; > > } else if (xinobits) { > > - unsigned int shift = 64 - xinobits; > > /* > > * All inode numbers of underlying fs should not be using the > > * high xinobits, so we use high xinobits to partition the > > @@ -116,11 +116,25 @@ static int ovl_map_dev_ino(struct dentry *dentry, struct kstat *stat, int fsid) > > * overlay mount boundaries. > > * > > * If not all layers are on the same fs the pair {real st_ino; > > - * overlay st_dev} is not unique, so use the non persistent > > - * overlay st_ino for directories. > > + * overlay st_dev} is not unique, so use a non persistent > > + * overlay st_ino for directories, which is allocated with > > + * get_next_ino() and stored in i_generation. > > + * > > + * To avoid ino collision with legitimate xino values from upper > > + * layer (fsid 0), use fsid 1 to map the non persistent inode > > + * numbers to the unified st_ino address space. > > + * > > + * In this case (xino bits overflow on directory inode), the > > + * value of overlay inode st_ino will not be consistent with > > + * d_ino and i_ino. i_ino will have the value of the real inode > > + * i_ino and d_ino will have either the value of i_ino or the > > + * value of st_ino, depending on the directory iteration mode > > + * that is used on the parent (i.e. real/merge/impure). > > */ > > stat->dev = dentry->d_sb->s_dev; > > - stat->ino = dentry->d_inode->i_ino; > > + stat->ino = dentry->d_inode->i_generation; > > I think we should avoid misusing i_generation for this. It's > confusing and not woth it, IMO. OK, I though I could maintain i_ino -> d_ino consistency this way and make nfsdv3 happy, but it's not true for all types of dir iteration, so I'll drop it. > > > + if (xinobits) > > + stat->ino |= ((u64)1) << shift; > > I don't like this magic shifting. > > > > } else { > > /* > > * For non-samefs setup, if we cannot map all layers st_ino > > @@ -128,7 +142,7 @@ static int ovl_map_dev_ino(struct dentry *dentry, struct kstat *stat, int fsid) > > * is unique per underlying fs, so we use the unique anonymous > > * bdev assigned to the underlying fs. > > */ > > - stat->dev = OVL_FS(dentry->d_sb)->fs[fsid].pseudo_dev; > > + stat->dev = ovl_dentry_fs(dentry, fsid)->pseudo_dev; > > } > > > > return 0; > > @@ -564,6 +578,7 @@ static inline void ovl_lockdep_annotate_inode_mutex_key(struct inode *inode) > > static void ovl_map_ino(struct inode *inode, unsigned long ino, int fsid) > > { > > int xinobits = ovl_xino_bits(inode->i_sb); > > + bool overflow = ino >> (64 - xinobits); > > > > /* > > * When d_ino is consistent with st_ino (samefs or i_ino has enough > > @@ -571,13 +586,30 @@ static void ovl_map_ino(struct inode *inode, unsigned long ino, int fsid) > > * so inode number exposed via /proc/locks and a like will be > > * consistent with d_ino and st_ino values. An i_ino value inconsistent > > * with d_ino also causes nfsd readdirplus to fail. > > + * > > + * When real inode bits overflow into xino bits, we leave i_ino value as > > + * the real inode to be consitent with d_ino. For directory inodes on > > + * non-samefs with xino disabled or xino overflow, we reserve a unique > > + * 32bit generation number, to be used for resolving st_ino collisions > > + * in ovl_map_dev_ino(). With xino disabled, this 32bit number is also > > + * used as i_ino. > > */ > > - if (ovl_same_dev(inode->i_sb)) { > > - inode->i_ino = ino; > > - if (xinobits && fsid && !(ino >> (64 - xinobits))) > > - inode->i_ino |= (unsigned long)fsid << (64 - xinobits); > > - } else { > > - inode->i_ino = get_next_ino(); > > + inode->i_ino = ino; > > + if (ovl_same_dev(inode->i_sb) && !overflow) { > > + inode->i_ino |= (unsigned long)fsid << (64 - xinobits); > > While this makes sense on 64bit arch, it's going to overflow on 32bit > (due to i_ino being "unsigned long"). It's not clear here, but on 32bit, xinobits is 0: ofs->xino_mode = BITS_PER_LONG - 32; To the expression doesn't change i_ino. Correct? Want me to clarify that by comment or by code? > > > + } else if (S_ISDIR(inode->i_mode)) { > > + /* > > + * Reuse unique 32bit ino from recycled ovl_inode object. > > + * get_next_ino() wraps around at 32bit, but may be extended > > + * to 64bit in the future, so be prepared. > > + */ > > + if (!inode->i_generation) { > > + inode->i_generation = (u32)get_next_ino(); > > + if (unlikely(!inode->i_generation)) > > + inode->i_generation = (u32)get_next_ino(); > > + } > > And this is really messy. How about an atomic64_t counter in ovl_fs > instead? Do we really care about the performance of inode allocation > for the overflow case? No, that sounds right. > > > + if (!ovl_same_dev(inode->i_sb)) > > + inode->i_ino = inode->i_generation; > > } > > } > > > > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h > > index 140510d24d9f..c0b15fd2b395 100644 > > --- a/fs/overlayfs/overlayfs.h > > +++ b/fs/overlayfs/overlayfs.h > > @@ -311,6 +311,13 @@ static inline unsigned int ovl_xino_bits(struct super_block *sb) > > return ofs->xino_bits; > > } > > > > +static inline struct ovl_sb *ovl_dentry_fs(struct dentry *dentry, int fsid) > > +{ > > + /* fsid bit 1 is reserved for non persistent ino range */ > > + WARN_ON_ONCE(fsid & 1); > > + return &OVL_FS(dentry->d_sb)->fs[fsid >> 1]; > > Again, magic shifts. Would be so much nicer to leave the fsid > definition as the index into the ->fs array and deal with this when > creating the st_ino/i_ino values. OK. > > > +} > > + > > /* All layers on same fs? */ > > static inline bool ovl_same_fs(struct super_block *sb) > > { > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > > index d072f982d3de..d636a23df541 100644 > > --- a/fs/overlayfs/super.c > > +++ b/fs/overlayfs/super.c > > @@ -1272,8 +1272,8 @@ static bool ovl_lower_uuid_ok(struct ovl_fs *ofs, const uuid_t *uuid) > > return true; > > } > > > > -/* Get a unique fsid for the layer */ > > -static int ovl_get_fsid(struct ovl_fs *ofs, const struct path *path) > > +/* Get a unique fs for the layer */ > > +static int ovl_get_fs(struct ovl_fs *ofs, const struct path *path) > > { > > struct super_block *sb = path->mnt->mnt_sb; > > unsigned int i; > > @@ -1353,9 +1353,9 @@ static int ovl_get_layers(struct super_block *sb, struct ovl_fs *ofs, > > for (i = 0; i < numlower; i++) { > > struct vfsmount *mnt; > > struct inode *trap; > > - int fsid; > > + int n; > > > > - err = fsid = ovl_get_fsid(ofs, &stack[i]); > > + err = n = ovl_get_fs(ofs, &stack[i]); > > if (err < 0) > > goto out; > > > > @@ -1387,9 +1387,10 @@ static int ovl_get_layers(struct super_block *sb, struct ovl_fs *ofs, > > ofs->layers[ofs->numlower].trap = trap; > > ofs->layers[ofs->numlower].mnt = mnt; > > ofs->layers[ofs->numlower].idx = ofs->numlower; > > - ofs->layers[ofs->numlower].fsid = fsid; > > - ofs->layers[ofs->numlower].fs = &ofs->fs[fsid]; > > - ofs->fs[fsid].is_lower = true; > > + /* fsid bit 1 is reserved for non persistent ino range */ > > + ofs->layers[ofs->numlower].fsid = n << 1; > > + ofs->layers[ofs->numlower].fs = &ofs->fs[n]; > > + ofs->fs[n].is_lower = true; > > } > > > > /* > > @@ -1398,7 +1399,8 @@ static int ovl_get_layers(struct super_block *sb, struct ovl_fs *ofs, > > * free high bits in underlying fs to hold the unique fsid. > > * If overlayfs does encounter underlying inodes using the high xino > > * bits reserved for fsid, it emits a warning and uses the original > > - * inode number. > > + * inode number or a non persistent inode number allocated from a > > + * dedicated range. > > */ > > if (ofs->numfs == 1 || (ofs->numfs == 2 && !ofs->upper_mnt)) { > > if (ofs->config.xino == OVL_XINO_ON) > > @@ -1408,11 +1410,13 @@ static int ovl_get_layers(struct super_block *sb, struct ovl_fs *ofs, > > } else if (ofs->config.xino == OVL_XINO_ON && !ofs->xino_bits) { > > /* > > * This is a roundup of number of bits needed for encoding > > - * fsid, where fsid 0 is reserved for upper fs even with > > - * lower only overlay. > > + * fsid, where fsid 0 is reserved for upper fs (even with > > + * lower only overlay) and fsid bit 1 is reserved for the non > > + * persistent inode number range that is used for resolving > > + * xino lower bits overflow. > > */ > > - BUILD_BUG_ON(ilog2(OVL_MAX_STACK) > 31); > > - ofs->xino_bits = ilog2(ofs->numfs - 1) + 1; > > + BUILD_BUG_ON(ilog2(OVL_MAX_STACK) > 30); > > + ofs->xino_bits = ilog2(ofs->numfs - 1) + 2; > > > Just realized, this patch is obsolete (xino_bits ->xino_mode). > > Anyway, I think the comments are valid for the updated patch as well. > Yeh, it's mostly the same. Branch ovl-ino is already rebased. If you have no other comments, I'll prepare v2 and test it with 5.6-rc2. Thanks, Amir.