On 2019/12/20 10:49, Chris Down wrote: > In Facebook production we are seeing heavy inode number wraparounds on > tmpfs. On affected tiers, in excess of 10% of hosts show multiple files > with different content and the same inode number, with some servers even > having as many as 150 duplicated inode numbers with differing file > content. > > This causes actual, tangible problems in production. For example, we > have complaints from those working on remote caches that their > application is reporting cache corruptions because it uses (device, > inodenum) to establish the identity of a particular cache object, but > because it's not unique any more, the application refuses to continue > and reports cache corruption. Even worse, sometimes applications may not > even detect the corruption but may continue anyway, causing phantom and > hard to debug behaviour. > > In general, userspace applications expect that (device, inodenum) should > be enough to be uniquely point to one inode, which seems fair enough. > This patch changes get_next_ino to use up to min(sizeof(ino_t), 8) bytes > to reduce the likelihood of wraparound. On architectures with 32-bit > ino_t the problem is, at least, not made any worse than it is right now. > > I noted the concern in the comment above about 32-bit applications on a > 64-bit kernel with 32-bit wide ino_t in userspace, as documented by Jeff > in the commit message for 866b04fc, but these applications are going to > get EOVERFLOW on filesystems with non-volatile inode numbers anyway, > since those will likely be 64-bit. Concerns about that seem slimmer > compared to the disadvantages this presents for known, real users of > this functionality on platforms with a 64-bit ino_t. > > Other approaches I've considered: > > - Use an IDA. If this is a problem for users with 32-bit ino_t as well, > this seems a feasible approach. For now this change is non-intrusive > enough, though, and doesn't make the situation any worse for them than > present at least. > - Look for other approaches in userspace. I think this is less > feasible -- users do need to have a way to reliably determine inode > identity, and the risk of wraparound with a 2^32-sized counter is > pretty high, quite clearly manifesting in production for workloads > which make heavy use of tmpfs. I have sent an IDA approache before, see details on https://patchwork.kernel.org/patch/11254001/ > > Signed-off-by: Chris Down <chris@xxxxxxxxxxxxxx> > Reported-by: Phyllipe Medeiros <phyllipe@xxxxxx> > Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > Cc: Jeff Layton <jlayton@xxxxxxxxxx> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx> > Cc: Tejun Heo <tj@xxxxxxxxxx> > Cc: linux-fsdevel@xxxxxxxxxxxxxxx > Cc: linux-kernel@xxxxxxxxxxxxxxx > Cc: kernel-team@xxxxxx > --- > fs/inode.c | 29 ++++++++++++++++++----------- > include/linux/fs.h | 2 +- > 2 files changed, 19 insertions(+), 12 deletions(-) > > diff --git a/fs/inode.c b/fs/inode.c > index aff2b5831168..8193c17e2d16 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -870,26 +870,33 @@ static struct inode *find_inode_fast(struct super_block *sb, > * This does not significantly increase overflow rate because every CPU can > * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is > * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the > - * 2^32 range, and is a worst-case. Even a 50% wastage would only increase > - * overflow rate by 2x, which does not seem too significant. > + * 2^32 range (for 32-bit ino_t), and is a worst-case. Even a 50% wastage would > + * only increase overflow rate by 2x, which does not seem too significant. With > + * a 64-bit ino_t, overflow in general is fairly hard to achieve. > * > - * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW > - * error if st_ino won't fit in target struct field. Use 32bit counter > - * here to attempt to avoid that. > + * Care should be taken not to overflow when at all possible, since generally > + * userspace depends on (device, inodenum) being reliably unique. > */ > #define LAST_INO_BATCH 1024 > -static DEFINE_PER_CPU(unsigned int, last_ino); > +static DEFINE_PER_CPU(ino_t, last_ino); > > -unsigned int get_next_ino(void) > +ino_t get_next_ino(void) > { > - unsigned int *p = &get_cpu_var(last_ino); > - unsigned int res = *p; > + ino_t *p = &get_cpu_var(last_ino); > + ino_t res = *p; > > #ifdef CONFIG_SMP > if (unlikely((res & (LAST_INO_BATCH-1)) == 0)) { > - static atomic_t shared_last_ino; > - int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino); > + static atomic64_t shared_last_ino; > + u64 next = atomic64_add_return(LAST_INO_BATCH, > + &shared_last_ino); > > + /* > + * This might get truncated if ino_t is 32-bit, and so be more > + * susceptible to wrap around than on environments where ino_t > + * is 64-bit, but that's really no worse than always encoding > + * `res` as unsigned int. > + */ > res = next - LAST_INO_BATCH; > } This approach is same to https://patchwork.kernel.org/patch/11023915/ which was > #endif > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 190c45039359..ca1a04334c9e 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -3052,7 +3052,7 @@ static inline void lockdep_annotate_inode_mutex_key(struct inode *inode) { }; > #endif > extern void unlock_new_inode(struct inode *); > extern void discard_new_inode(struct inode *); > -extern unsigned int get_next_ino(void); > +extern ino_t get_next_ino(void); > extern void evict_inodes(struct super_block *sb); > > extern void __iget(struct inode * inode);