On Fri, Dec 20, 2019 at 02:49:36AM +0000, Chris Down wrote: > In Facebook production we are seeing heavy inode number wraparounds on > tmpfs. On affected tiers, in excess of 10% of hosts show multiple files > with different content and the same inode number, with some servers even > having as many as 150 duplicated inode numbers with differing file > content. > > This causes actual, tangible problems in production. For example, we > have complaints from those working on remote caches that their > application is reporting cache corruptions because it uses (device, > inodenum) to establish the identity of a particular cache object, but ...but you cannot delete the (dev, inum) tuple from the cache index when you remove a cache object?? > because it's not unique any more, the application refuses to continue > and reports cache corruption. Even worse, sometimes applications may not > even detect the corruption but may continue anyway, causing phantom and > hard to debug behaviour. > > In general, userspace applications expect that (device, inodenum) should > be enough to be uniquely point to one inode, which seems fair enough. Except that it's not. (dev, inum, generation) uniquely points to an instance of an inode from creation to the last unlink. --D > This patch changes get_next_ino to use up to min(sizeof(ino_t), 8) bytes > to reduce the likelihood of wraparound. On architectures with 32-bit > ino_t the problem is, at least, not made any worse than it is right now. > > I noted the concern in the comment above about 32-bit applications on a > 64-bit kernel with 32-bit wide ino_t in userspace, as documented by Jeff > in the commit message for 866b04fc, but these applications are going to > get EOVERFLOW on filesystems with non-volatile inode numbers anyway, > since those will likely be 64-bit. Concerns about that seem slimmer > compared to the disadvantages this presents for known, real users of > this functionality on platforms with a 64-bit ino_t. > > Other approaches I've considered: > > - Use an IDA. If this is a problem for users with 32-bit ino_t as well, > this seems a feasible approach. For now this change is non-intrusive > enough, though, and doesn't make the situation any worse for them than > present at least. > - Look for other approaches in userspace. I think this is less > feasible -- users do need to have a way to reliably determine inode > identity, and the risk of wraparound with a 2^32-sized counter is > pretty high, quite clearly manifesting in production for workloads > which make heavy use of tmpfs. > > Signed-off-by: Chris Down <chris@xxxxxxxxxxxxxx> > Reported-by: Phyllipe Medeiros <phyllipe@xxxxxx> > Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > Cc: Jeff Layton <jlayton@xxxxxxxxxx> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx> > Cc: Tejun Heo <tj@xxxxxxxxxx> > Cc: linux-fsdevel@xxxxxxxxxxxxxxx > Cc: linux-kernel@xxxxxxxxxxxxxxx > Cc: kernel-team@xxxxxx > --- > fs/inode.c | 29 ++++++++++++++++++----------- > include/linux/fs.h | 2 +- > 2 files changed, 19 insertions(+), 12 deletions(-) > > diff --git a/fs/inode.c b/fs/inode.c > index aff2b5831168..8193c17e2d16 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -870,26 +870,33 @@ static struct inode *find_inode_fast(struct super_block *sb, > * This does not significantly increase overflow rate because every CPU can > * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is > * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the > - * 2^32 range, and is a worst-case. Even a 50% wastage would only increase > - * overflow rate by 2x, which does not seem too significant. > + * 2^32 range (for 32-bit ino_t), and is a worst-case. Even a 50% wastage would > + * only increase overflow rate by 2x, which does not seem too significant. With > + * a 64-bit ino_t, overflow in general is fairly hard to achieve. > * > - * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW > - * error if st_ino won't fit in target struct field. Use 32bit counter > - * here to attempt to avoid that. > + * Care should be taken not to overflow when at all possible, since generally > + * userspace depends on (device, inodenum) being reliably unique. > */ > #define LAST_INO_BATCH 1024 > -static DEFINE_PER_CPU(unsigned int, last_ino); > +static DEFINE_PER_CPU(ino_t, last_ino); > > -unsigned int get_next_ino(void) > +ino_t get_next_ino(void) > { > - unsigned int *p = &get_cpu_var(last_ino); > - unsigned int res = *p; > + ino_t *p = &get_cpu_var(last_ino); > + ino_t res = *p; > > #ifdef CONFIG_SMP > if (unlikely((res & (LAST_INO_BATCH-1)) == 0)) { > - static atomic_t shared_last_ino; > - int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino); > + static atomic64_t shared_last_ino; > + u64 next = atomic64_add_return(LAST_INO_BATCH, > + &shared_last_ino); > > + /* > + * This might get truncated if ino_t is 32-bit, and so be more > + * susceptible to wrap around than on environments where ino_t > + * is 64-bit, but that's really no worse than always encoding > + * `res` as unsigned int. > + */ > res = next - LAST_INO_BATCH; > } > #endif > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 190c45039359..ca1a04334c9e 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -3052,7 +3052,7 @@ static inline void lockdep_annotate_inode_mutex_key(struct inode *inode) { }; > #endif > extern void unlock_new_inode(struct inode *); > extern void discard_new_inode(struct inode *); > -extern unsigned int get_next_ino(void); > +extern ino_t get_next_ino(void); > extern void evict_inodes(struct super_block *sb); > > extern void __iget(struct inode * inode); > -- > 2.24.1 >