Re: [PATCH 0/4] Fiemap, an extent mapping ioctl - round 2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Andreas Dilger wrote:
> On Jul 03, 2008  16:17 +0100, Jamie Lokier wrote:
> > jim owens wrote:
> > >   FIEMAP_EXTENT_NO_BYPASS
> > > 
> > > As in "you can't bypass the filesystem" to directly access it.
> > 
> > Can we also commit to this, when FIEMAP_EXTENT_NO_BYPASS is *not* set:
> > 
> >    1. The data at fe_physical, and *will not move* so long as nothing
> >       modifies *that particular file*?
> > 
> >    2. Both reading *and writing* the file bypassing the filesystem are ok.
> 
> I don't think any such guarantee can be made.  What if the file is
> truncated and rewritten after the FIEMAP is called?

That is prohibited by "so long as nothing modifies that particular file".
That's the entire point of 1! :-)

> The filesystem can't guarantee that will not happen.

The filesystem's guarantee has to be _conditional_ on nothing _else_
modifying the file.  That includes writing, truncating, and extending.
It's not the filesystem's job to prevent those things.

What I'm saying is that some filesystems will move data blocks _even
when no process touches the file containing those blocks_.  E.g. some
filesystems do garbage collection in the background - even when
nothing touches any file.  Some filesystems clone data blocks for COW.
There are many imaginable other reasons.

Clearly, any program that "gets away with it" by using FIEMAP to get a
block map and then accessing the disk directly, is less reliable with
those filesystems.  It would be good to reflect that somehow.

The obvious way to my mind is for those filesystems which don't have
stable data positions, when a file is not being modified, to set the
flag which says "this extent should not be accessed directly"
(whatever it is called :-).

> I think the only way to make sure of constant mapping is to call
> FIEMAP before and after the blocks are read.

No, that is clearly unsafe.  They can change twice, ending up back at
the same positions, but different in between.  That's even likely,
with some modern filesystem techniques.

> > The reason for 2 is that some filesystems checksum the data and/or
> > replicate it, and won't be readable if you write to it directly.
> 
> EEEEEK.  The _intent_ of FIEMAP is mostly for reporting fragmentation,
> and possibly to allow a "generic" defragmenter to be written.  At an
> outside stretch I could imagine some tools like "dump" wanting direct
> read access to the file data.

Potentially useful other cases are providing good information to
assist access patterns and block allocation for things like databases,
filesystems-in-a-file, and virtual-disks-in-a-non-flat-file.  Those
are all variations on reporting fragmentation, and don't require the
information to be absolutely stable or correct.

> Directly writing underneath a filesystem is major bad news and will
> likely corrupt the filesystem because you can never be sure that there
> aren't dirty pages in the page cache that will overwrite your "direct"
> write, or that your write isn't racy with an unlink or truncate.

You're right.  It's a fair point, should be clarified, because I
hadn't thought of it ;-)

Btw, you can be sure there aren't dirty pages, if you have done
fsync() or sync_file_range() at some time in the past, and you are
_sure_ no other process is accessing the file.  (Otoh, I'm not sure if
some funky COW implementations would complicate that.)

However, that still leaves a gaping lack of coherency in that the
filesystem may have clean cached pages not matching what is written to
disk.  So, you're absolutely right: NO WRITING.

You must do fsync() anyway, and ensure nobody is modifying the file,
if you're going to read correct data from FIEMAP blocks.

Ok, then I'll remove point 2 and add these:

    - FIEMAP extents are _not_ safe for writing data directly!
      Page cache coherency affects all filesystems.  Checksums and
      replication are also involved with some filesystems.  All
      writing should go through the filesystem itself.

    - If reading data directly, do fsync() before FIEMAP, and be
      absolutely sure no process modifies the file between
      fsync+FIEMAP and reading the blocks, and that the
      FIEMAP_EXTENT_NO_DIRECT flag is not set.  It is the
      application's responsibility to ensure no other process modifies
      the file.

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux