Re: RFC: [PATCH] staging/lustre/llite: fix O_TMPFILE/O_LOV_DELAY_CREATE conflict

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2014/02/11, 2:13 AM, "Christoph Hellwig" <hch@xxxxxxxxxxxxx> wrote:
>On Mon, Feb 10, 2014 at 09:29:29PM +0000, Al Viro wrote:
>> I can live with that; it's a kludge, but it's less broken than that
>> explicit constant - that one is a non-starter, since O_... flag
>> values are arch-dependent.
>
>Grabbing their own O_FLAG is of course not acceptable at all.
>Personally I don't think this version is acceptable for real mainline
>either.  What exactly are the semantics of the flag?  Why don't you do
>object allocation on demand like all delalloc filesystems by default?

This was described in the original patch and follow-on email, but I'll
repeat it here, and expand the detail a bit further:

In kernel 3.11 O_TMPFILE was introduced, but the open flag value
conflicts with the O_LOV_DELAY_CREATE flag 020000000 previously used
by Lustre-aware applications.  O_LOV_DELAY_CREATE allows applications
to defer file layout and object creation from open time (the default)
until it can instead be specified by the application using an ioctl.

The main goal of the O_LOV_DELAY_CREATE flag is to allow the file to be
opened in a "preliminary" manner to allow the application to specify the
layout of the file across the Lustre storage targets (e.g. whether the
app has millions of separate files each one written to a single server,
or there is a single huge file spread across all of the servers, or some
combination of the two, if it is RAID-0 or RAID-1, or whatever).


FYI, an "object" in Lustre is not a fixed-size chunk of space like
Ceph or HDFS that needs to be continuously allocated as a file grows,
but rather a variable-sized inode-without-a-name that is written at
arbitrary byte offsets and can be sparse, so there is no need for
the client and metadata server to communicate after the initial
file layout has been decided.

The Lustre object(s) are normally allocated by the metadata server at
open time to avoid RPC round-trips and lock contention for files opened
by large numbers of nodes at once.  The layout is normally specified by
the filesystem default, or on the parent directory, but some applications
need fine-grained control over the layout to optimize for a particular
filesystem configuration.

Instead of trying to find a non-conflicting O_LOV_DELAY_CREATE flag
or define a Lustre-specific flag that isn't of use to most/any other
filesystems, use (O_NOCTTY|FASYNC) as the new value.  These flag
are not meaningful for newly-created regular files and should be
OK since O_LOV_DELAY_CREATE is only meaningful for new files.

I looked into using O_ACCMODE/FMODE_WRITE_IOCTL, which allows calling
ioctl() on the minimally-opened fd and is close to what is needed,
but that doesn't allow specifying the actual read or write mode for
the file, and fcntl(F_SETFL) doesn't allow O_RDONLY/O_WRONLY/O_RDWR
to be set after the file is opened.

We want to avoid the need to have lots of syscalls to do this, since
they translate into extra RPCs that we want to avoid when creating
potentially millions of files over the network.



Cheers, Andreas
-- 
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux