Re: [PATCH 1/5] vfs: vfs-level fiemap interface

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Andreas Dilger wrote:

So, I think we need another __u64 in he fiemap_extent which is
fe_loglength, and rename fe_length to fe_physlength.

As some people guessed, my earlier post [PATCH 0/5]:
> experience with a non-linux filesystem that has a similar API
refers to AdvFS on Tru64.

I was asked to provide more information about Tru64's equivalent
to fiemap.  I believe the person who asked wanted to get closure
on the fiemap definition, but I'm probably going to just throw
more gasoline around and light the match with this :)

My earlier post said how I thought our code worked, and as
usual if I describe something without looking at the code,
when I really go look at it I find it does something else
and I'm saying "damn, I didn't think it was that ugly".

Well it probably is ugly and it is really 4 different interfaces,
but after thinking about it I realized the 4 interface designs are
KISS defensible as being optimal for their intended use.
Here is the 10 year old "most used API" for userspace code:

#define F_GETMAP        21      /* retrieve a file's sparseness map */

struct extentmapentry {
        unsigned long offset;
        unsigned long size;
};

struct extentmap {
        unsigned long arraysize;
        unsigned long numextents;
        unsigned long offset;
        struct extentmapentry *extent;
};

fcntl(fileno, F_GETMAP, &extentmap)

Backup/dump tools call this fcntl() to retrieve the sparseness map
of an AdvFS or UFS file.  NFS and CD filesystems return an error.

Its intent is to return the LOGICAL extent map of a file, without
regard to the physical extent map of the file.  Multiple extents
will only be returned if the file in question is a sparse file.
All logically contiguous extents will be collapsed into a single
extent. FYI, "longs" are 64 bits on Tru64.

The extentmapentry.offset is byte-in-file and extentmapentry.size
is bytes-in-extent and only allocated data extents are returned
so there is no need for "extent type".  The extentmapentry is
designed to be small so that minimum memory is required when the
file is highly sparse-fragmented.

extentmap.arraysize is really max_extents (in) and "how_many_more"
(out) extents are present after the "numextents" (out) in the
*extent output array.  The part (so ugly) I forgot is that the
extentmap.offset is NOT a "starting byte in file", it is a
"skip over this many data extents".  That is not an intuitive api
but then I realized it is precisely the best for a backup program.

The backup always reads the complete file from 0..filesize, it
wants to duplicate sparse as sparse (or at least not read it
from the disk), it needs to use a reasonably sized extent array,
so it needs to walk forward in a loop (as in get 4 extents in
one call (0..3, 4..7, 8..11).  So extentmap.offset as an index
into the file's logical map makes sense and you don't need to
worry about start-at-byte-in-file not being an extent start.

A program that wanted to optimize random reads to a sparse file
could do it using this api though not as easily as if it had
the start_byte input parameter.

I'm not going to bore you with the other 3 interfaces that are
only supported in AdvFS to retrieve RAW extent maps for the
cluster and filesystem administrative tools.  These are the only
interfaces that return extent device allocation because normal
applications including backup need to do their data access
through the filesystem.  The bottom line is that information
that is filesystem-specific is only really valuable to tools that
are filesystem-specific.

=== LIGHT THE MATCH ===

- I don't want linux to implement Tru64 F_GETMAP for fiemap!

- The lesson is that a simple design covers the major use and
  other complicated needs are done somewhere else.

- I have talked to Mark and he has tools waiting to use the
  features he originally designed into fiemap... but every
  day there is a new flag or return field added "just in case".

- I know "memory is cheap", but we still seem to run out of it
  so expanding every return structure for data that may only be
  useful to a specific filesystem seems like a bad idea.

- A simplified filesystem-independent version and separate
  complex-as-you-want filesystem-dependent api might be better,
  for example:

  * We can't even agree what "device" is.
  * What good is "encrypted" or "compressed" without "how"?

=== THROW ON MORE GASOLINE ===

Subject:    [RFC] add FIEMAP ioctl to efficiently map file allocation
From:       Andreas Dilger <adilger () clusterfs ! com>
Date:       2007-04-12 11:05:50
Message-ID: 20070412110550.GM5967 () schatzie ! adilger ! int

we additionally need to get the mapping over the network so it needs to
be efficient in terms of how data is passed, and how easily it can be
extracted from the filesystem.

jim
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux