Re: CephFS and 32-bit Inode Numbers

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 16 Oct 2019 10:26:47 +0200

Ceph's ino32 mount option has trivial collisions.

The hash is `ceph_ino_to_ino32` here:
   https://github.com/torvalds/linux/blob/master/fs/ceph/super.h#L438

A simple collision can be demonstrated:

def ceph_ino_to_ino32(vino):
    ino = vino & 0xffffffff;
    ino ^= vino >> 32;
    if not ino:
        ino = 2
    return ino

print ceph_ino_to_ino32(0x10000000301)  # 0x302'nd inode on mds.0
print ceph_ino_to_ino32(0x20000000001)  # 2nd inode on mds.1

Both 0x10000000301 and 0x20000000001 hash to ino32=513

So I know that collisions are very likely when using multiple active MDSs.

So I wondered: if we pin the mount prefix to a single mds, then maybe
collisions are less likely?
Seems so, but I still found exactly one collision in the range(1<<40)
to (1<<40)+(1<<25) :   0x10000000102 and 0x10000000100 both hash to
ino32=2

Since the collisions are inevitable -- are they handled in some
sane/safe way on the mds side? If not -- maybe we should improve or
remove the ino32 kernel option?

Cheers, Dan

On Wed, Oct 16, 2019 at 9:48 AM Ingo Schmidt <i.schmidt@xxxxxxxxxxx> wrote:
>
> This is not quite true. The numberspace of MD5 is much greater than 2³², (2¹²⁸ exactly) and as long as you don't exhaust this Numberspace, the probability of having a collision is roughly equally likely as with any other input. There might be collisions, and the more Data you have, i.e. the more Adresses you use, the higher the probability.
> Security researchers have shown that it is possible to create collisions, but it is very rare.
>
> I cannot give you an estimate of the consequences of a collision though. It's a matter of what data is stored at that address and how Programs/OSes and even Ceph deal with this. I would suspect, ceph would find a checksum mismatch upon scrubbing. But I don't know how, or if ceph could or would correct this, as the two addresses with the same MD5sum have equally valid copies and i think in such a case it is undecidable, which Data is correct.
>
> Greetings
> Ingo
>
> ----- Ursprüngliche Mail -----
> Von: "Nathan Fish" <lordcirth@xxxxxxxxx>
> An: "ceph-users" <ceph-users@xxxxxxx>
> Gesendet: Dienstag, 15. Oktober 2019 19:40:05
> Betreff:  Re: CephFS and 32-bit Inode Numbers
>
> I'm not sure exactly what would happen on an inode collision, but I'm
> guessing Bad Things. If my math is correct, a 2^32 inode space will
> have roughly 1 collision per 2^16 entries. As that's only 65536,
> that's not safe at all.
>
> On Mon, Oct 14, 2019 at 8:14 AM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> >
> > OK I found that the kernel has an "ino32" mount option which hashes 64
> > bit inos to 32-bit space.
> > Has anyone tried this?
> > What happens if two files collide?
> >
> > -- Dan
> >
> > On Mon, Oct 14, 2019 at 1:18 PM Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> > >
> > > Hi all,
> > >
> > > One of our users has some 32-bit commercial software that they want to
> > > use with CephFS, but it's not working because our inode numbers are
> > > too large. E.g. his application gets a "file too big" error trying to
> > > stat inode 0x40008445FB3.
> > >
> > > I'm aware that CephFS is offsets the inode numbers by (mds_rank + 1) *
> > > 2^40; in the case above the file is managed by mds.3.
> > >
> > > Did anyone see this same issue and find a workaround? (I read that
> > > GlusterFS has an enable-in32 client option -- does CephFS have
> > > something like that planned?)
> > >
> > > Thanks!
> > >
> > > Dan
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx