Exporting a unique ino/dev pair to user space

Mark Fasheh <mfasheh@xxxxxxx> · Wed, 6 Jun 2018 23:38:07 +0200

Hi,

We have an inconsistency in how the kernel is exporting inode number /
device pairs for user space. There's of course stat(2) and statx(2),
but aside from those we simply dump inode->i_ino and super->s_dev. In
some cases, the dumped values differ from what is returned via stat(2)
or statx(2). Some filesystems might even show duplicate (but
internally different!) pairs when the raw i_ino/s_dev is used.

Some examples where we dump raw ino/dev:

- /proc/<pid>/maps. I've written about how this confuses lsof(8):
  https://marc.info/?l=linux-btrfs&m=130074451403261&w=2

- Unsurprisingly, many VFS tracepoints dump ino and/or dev. See
  trace/events/lock.h or trace/events/writeback.h for examples.

- eventpoll also dumps the raw ino/dev pair via ep_show_fdinfo()

- Audit records the raw ino/dev and passes them around. We do seem to
  have paths printed from audit as well, but if it's printed with the
  wrong ino/dev pair I believe my point still stands.

This breaks software which expects these pairs to be unique, and can
put the user in a situation where they might not be able to find an
inode referenced from the kernel. What's even worse - depending on how
ino is exported, they might even find the *wrong* inode.

I also should point out that we're likely at this point because
stat(2) has been using an unsigned long for ino. On a 32 bit system,
it would have been impossible for the user to get the real inode
number in the first place. So there probably wasn't much we could do.

These days though, we have statx(2) which will do the right thing on
all platforms so we no longer have that excuse. The user ought to be
able to take an ino/dev pair and ultimately find the actual file on
their system, partially with the help of statx(2).

Some examples of how ino is manipulated by filesystems:

- On 64 bit i_ino and kstat->ino tend to be filled in correctly (from
  what I can tell). stat->dev is not always consistent with super->s_dev.

- On 32 bits, many filesystems employ a transformation to squeeze a 64
  bit identifier into 32 bits. The exact methods are fs specific,
  what's important is that we're losing information, introducing the
  possibility of duplicate inode numbers.

- On all platforms, Btrfs and Overlayfs can have duplicate inode
  numbers. Of course, device can be different across the fs as well
  with the kernel reporting s_dev and these filesystems returning
  a different device via stat() or statx().

Getting the inode number portion of this pair fixed would immediately
solve the situation for all filesystems except Btrfs and
Overlayfs - they report a different device from stat.

Regarding the device portion of the pair, I'm honestly not sure
whether Overlayfs cares, and my attempts to fix the s_dev situation
for Btrfs have all come to the same dead ends that I've hit briefly
looking into this inode number issue - the problems are intrinsically
linked.

So my questions are:

1) Do we care about this? On one hand, we've had this inconsistency
   for a long time, for various reasons. On the other hand, I can point
   to bugzilla's where these inconsistencies have become a problem.

   In the case that we don't care, any fs-internal solutions are
   likely to be extremely disruptive to the end user.

2) If we do care, how do we fix this?

 2a) Do we use 64 bit i_ino on 32 bit systems? This has some obvious
     downsides from a memory usage standpoint. Also it doesn't fully
     address the issue - we still have a device field that Btrfs and
     Overlayfs override.

     We could combine this with an intermediate structure between the
     inode and super block so s_dev could be abstracted out. See my
     fs_view patches for an example of how this could be done:
     https://www.spinics.net/lists/linux-fsdevel/msg125492.html

 2b) Do we call ->getattr for this information? Plumbing a vfsmount to
     various regions of the kernel like audit, or some of the deeper
     tracepoints looks ugly and prone to life-timing issues (which
     vfsmount do we use?). On the upside, we could probably make it
     really light by only sending down the STATX_INO flag and letting
     the filesystem optimize accordingly.

 2c) I don't think we can really use a dedicated callback without
     passing the vfsmount through since Overlayfs ->getattr might call
     the lower fs ->getattr. At that point we might as well use getattr.

I'd appreciate any comments or suggestions.

Thanks,
  --Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html