RFC: Stable inodes for inode-less filesystems (was: Finding hardlinks)

Bodo Eggert <7eggert@xxxxxx> · Fri, 05 Jan 2007 23:36:54 +0100

Pavel Machek <pavel@xxxxxx> wrote:

>> Another idea is to export the filesystem internal ID as an arbitray
>> length cookie through the extended attribute interface.  That could be
>> stored/compared by the filesystem quite efficiently.
> 
> How will that work for FAT?

> Or maybe we can relax that "inode may not change over rename" and
> "zero length files need unique inode numbers"...

I didn't look into the code, and I'm not experienced in writing (linux)
fs, but I have an Idea I'd like to share. Maybe it's not that bad ...

(I'm going to type about inode numbers, since having constant inodes
 is desired and the extended attribute would only be an aid if the
 inode is too small.)

IIRC, no cluster is reserved for empty files on FAT; if I'm wrong, it'll
be much easier, you would just use the cluster-number (less than 32 bit).

The basic idea is to use a different inode range for non-empty and empty
files. This will result in the inode possibly changing after close()* or
on rename(empty1, empty2). OTOH it will keep a stable inode for non-empty
files and for newly written files** if they aren't stat()ed before writing
the first byte. I'm not sure if it's better than changing inodes after
$randomtime, but I just made a quick strace on gtar, rsync and cp -a;
they don't look at the dest inode before it would change (or at all).

(If this idea is applied to iso9660, the hard problem will be finding the
 number of hardlinked files for one location)

Changing the inode# on the last close* can be done by purging the cache
if the file is empty XOR the file has an inode# from the empty-range.
(That should be the same operation as done by unlink()?)
A new open(), stat() or readdir should create the correct kind of inode#.

*) You can optionally just wait for the inode to expire, but you need to
   keep the associated reservation(s) until then. I don't expect any
   visible effect from doing this, but see ** from the next paragraph
   on how to minimize the effect. The reserved directory entry (see far
   below in this text) is 32 Bytes, but the fragmentation may be bad.
**) which are empty on open() and therefore don't yet have a stable inode#
   Those inode numbers will apear to be stable because nobody saw them
   change. It's safe to change the inode# because by reserving disk space,
   we got a unique inode#. I hope the kernel side allows this ...

For non-empty files, you can use the cluster-number (start address), it's
unique, and won't change on rename. It will, however, change on emptying
or writing to an empty file. If you write to an empty file, no special
handling is required, all reservations are implicit*. If you empty a file,
you'll have to keep the first cluster reserved** untill it's closed,
otherwise you'd risk an inode collision.

*) since the inode# doesn't change yet, you'll still have to treat it like
   an empty file while unlinking or renaming.
**) It's OK to reuse it if it's in the middle of a file, so you may
    optionally keep a list of these clusters and not start files there
    instead of reserving the space. OTOH, it's more work.

Empty files will be a PITA with 32-bit-inodes, since a full-sized FAT32 can
have about 2^38 empty files*. (The extended attribute would work as described
below.) You can, however, generate inode numbers for empty files, risking
collisions. This requires all generated inode numbers to be above 0x40000000
(or above the number of clusters on disk).

*) 8 TB divided by 32 B / directory entry

With 64-bit-values, you can generate an unique inode for empty files
using cluster#-of-dir | 0x80000000 | index_in_dir << 32. The downside
is, it will change on cross-directory-renames and may change on in-
directory-renames. If this happens to an open file, you'll need to
make sure the old inode# is not reused by reserving that directory
entry, since the inode# can't change for open files.

extra operations on final close:
if "empty" inode:
 if !empty
  unreserve_directory_entry(inode & 0x7fffffff, inode >> 32)
  uncache inode (will change inode#)
  stop
 if unreserve_directory_entry(inode & 0x7fffffff, inode >> 32)
  uncache inode
if "non-empty" inode
 if empty
  free start cluster
  uncache inode

extra operations on unlink/rename:
if "empty" inode:
 if can_use_current_inode#_for_dest
  do it
  unreserve_directory_entry(inode & 0x7fffffff, inode >> 32)
  // because of "mv a/empty b/empty; mv b/empty a/empty"
 else if is_open_file
  // the current inode won't fit the new location:
  reserve_directory_entry(old_inode & 0x7fffffff, inode >> 32)

extra operations on truncate
if "non-empty" inode && empty_after_truncate
 exempt start cluster from being freed,
 or put it on a list of non-startclusters

extra operation on extend
if "empty" inode && nobody did e.g. stat() after opening this file
 silently change inode, nobody will notice. Racy? Possible?

Required data in filehandle:
 Location of directory entry (d.e. contains inode information)
  (this shouldn't be new data?)
 stat-flag (possibly one per filesystem is enough)
Directories additionally:
 Pointer to a list of reserved directory entries (usurally NULL)

Required data in Superblock:
 ->Optional List of non-startclusters

-- 
Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF
verbreiteten Lügen zu sabotieren.

http://david.woodhou.se/why-not-spf.html
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html