Re: Provision for filesystem specific open flags

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 2017-11-14 at 17:35 +0000, Fu, Rodney wrote:
> > The filesystem can still choose to do that for O_DIRECT if it wants - look at
> > all the filesystems that have a "fall back to buffered IO because this is too
> > hard to implement in the direct Io path".
> 
> Yes, I agree that the filesystem can still decide to buffer IO even with
> O_DIRECT, but the application's intent is that the effects of caching are
> minimized.  Whereas with O_CONCURRENT_WRITE, the intent is to maximize caching.
> 
> > IOWs, you've got another set of custom userspace APIs that are needed to make
> > proper use of this open flag?
> 
> Yes and no.  Applications can make ioctls to the filesystem to query or set
> layout details but don't have to.  Directory level default layout attributes can
> be set up by an admin to meet the requirements of the application.
> 
> > > In panfs, a well behaved CONCURRENT_WRITE application will consider 
> > > the file's layout on storage.  Access from different machines will not 
> > > overlap within the same RAID stripe so as not to cause distributed 
> > > stripe lock contention.  Writes to the file that are page aligned can 
> > > be cached and the filesystem can aggregate multiple such writes before 
> > > writing out to storage.  Conversely, a CONCURRENT_WRITE application 
> > > that ends up colliding on the same stripe will see worse performance.  
> > > Non page aligned writes are treated by panfs as write-through and 
> > > non-cachable, as the filesystem will have to assume that the region of 
> > > the page that is untouched by this machine might in fact be written to 
> > > on another machine.  Caching such a page and writing it out later might lead to data corruption.
> > That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if
> > the app doesn't do correctly aligned and sized IO then performance is going to
> > suck, and if the apps doesn't serialize access to the file correctly it can and
> > will corrupt data in the file....
> 
> I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have
> opposite intents with respect to caching.  Our filesystem handles them
> differently, so we need to distinguish between the two.
> 
> > > The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the 
> > > application does not have to implement any caching to see good performance.
> > Sure, but it has to be aware of layout and where/how it can write, which is
> > exactly the same constraints that local filesystems place on O_DIRECT access.
> > Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics
> > and behaviour, IMO.
> 
> I'd like to make a slight adjustment to my proposal.  The HPC community had
> talked about extensions to POSIX to include O_LAZY as a way for filesystems to
> relax data coherency requirements.  There is code in the ceph filesystem that
> uses that flag if defined.  Can we get O_LAZY defined?
> 
> HEC POSIX extension:
> http://www.pdsw.org/pdsw06/resources/hec-posix-extensions-sc2006-workshop.pdf
> 
> Ceph usage of O_LAZY:
> https://github.com/ceph/ceph-client/blob/1e37f2f84680fa7f8394fd444b6928e334495ccc/net/ceph/ceph_fs.c#L78


O_LAZY support was removed from cephfs userland client in 2013:

    commit 94afedf02d07ad4678222aa66289a74b87768810
    Author: Sage Weil <sage@xxxxxxxxxxx>
    Date:   Mon Jul 8 11:24:48 2013 -0700

        client: remove O_LAZY

...part of the problem (and this may just be my lack of understanding)
is that it's not clear what O_LAZY semantics actually are. The ceph
sources have a textfile with this in it:

"-- lazy i/o integrity

  FIXME: currently missing call to flag an Fd/file has lazy.  used to be
O_LAZY on open, but no more.

  * relax data coherency
  * writes may not be visible until lazyio_propagate, fsync, close

  lazyio_propagate(int fd, off_t offset, size_t count);
   * my writes are safe

  lazyio_synchronize(int fd, off_t offset, size_t count);
   * i will see everyone else's propagated writes


lazyio_propagate / lazyio_synchronize. Those seem like they could be
implemented as ioctls if you don't care about other filesystems.

It is possible to add new open flags (we're running low, but that's a
problem we'll hit sooner or later anyway), but before we can do anything
here, O_LAZY needs to be defined in a way that makes sense for
application developers across filesystems.

How does this change behavior on ext4, xfs or btrfs, for instance? What
about nfs or cifs?

I suggest that before you even dive into writing patches for any of
this, that you draft a small manpage update for open(2). What would an
O_LAZY entry look like?

-- 
Jeff Layton <jlayton@xxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux