Re: Provision for filesystem specific open flags

Andreas Dilger <adilger@xxxxxxxxx> · Tue, 5 Dec 2017 14:36:07 -0700

On Dec 3, 2017, at 10:29 PM, NeilBrown <neilb@xxxxxxxx> wrote:
> 
> On Tue, Nov 14 2017, Fu, Rodney wrote:
> 
>>> The filesystem can still choose to do that for O_DIRECT if it wants - look at
>>> all the filesystems that have a "fall back to buffered IO because this is too
>>> hard to implement in the direct Io path".
>> 
>> Yes, I agree that the filesystem can still decide to buffer IO even with
>> O_DIRECT, but the application's intent is that the effects of caching are
>> minimized.  Whereas with O_CONCURRENT_WRITE, the intent is to maximize caching.
>> 
>>> IOWs, you've got another set of custom userspace APIs that are needed to make
>>> proper use of this open flag?
>> 
>> Yes and no.  Applications can make ioctls to the filesystem to query or set
>> layout details but don't have to.  Directory level default layout attributes can
>> be set up by an admin to meet the requirements of the application.
>> 
>>>> In panfs, a well behaved CONCURRENT_WRITE application will consider
>>>> the file's layout on storage.  Access from different machines will not
>>>> overlap within the same RAID stripe so as not to cause distributed
>>>> stripe lock contention.  Writes to the file that are page aligned can
>>>> be cached and the filesystem can aggregate multiple such writes before
>>>> writing out to storage.  Conversely, a CONCURRENT_WRITE application
>>>> that ends up colliding on the same stripe will see worse performance.
>>>> Non page aligned writes are treated by panfs as write-through and
>>>> non-cachable, as the filesystem will have to assume that the region of
>>>> the page that is untouched by this machine might in fact be written to
>>>> on another machine.  Caching such a page and writing it out later might lead to data corruption.
>> 
>>> That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if
>>> the app doesn't do correctly aligned and sized IO then performance is going to
>>> suck, and if the apps doesn't serialize access to the file correctly it can and
>>> will corrupt data in the file....
>> 
>> I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have
>> opposite intents with respect to caching.  Our filesystem handles them
>> differently, so we need to distinguish between the two.
>> 
>>>> The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the
>>>> application does not have to implement any caching to see good performance.
>> 
>>> Sure, but it has to be aware of layout and where/how it can write, which is
>>> exactly the same constraints that local filesystems place on O_DIRECT access.
>> 
>>> Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics
>>> and behaviour, IMO.
>> 
>> I'd like to make a slight adjustment to my proposal.  The HPC community had
>> talked about extensions to POSIX to include O_LAZY as a way for filesystems to
>> relax data coherency requirements.  There is code in the ceph filesystem that
>> uses that flag if defined.  Can we get O_LAZY defined?
> 
> This O_LAZY sounds exactly like what NFS has always done.
> If different clients do page aligned writes and have their own protocol
> to keep track of who owns which page, then everything is fine and
> write-back caching does good things.
> If different clients use byte-range locks, then write-back caching
> is curtailed a bit, but clients don't need to be so careful.
> If clients do non-aligned writes without locking, then corruption can
> result.
> So:
>  #define O_LAZY 0
> and NFS already has it implemented :-)
> 
> For NFS, with have O_SYNC which tries to provide cache-coherency as strong
> as other filesystems provide without it.
> 
> Do we really want O_LAZY?  Or are other filesystems trying too hard to
> provide coherency when apps don't use locks?

Well, POSIX requires the correct read-after-write behaviour regardless
of whether applications are being careful or not. As you wrote above,
"If clients do non-aligned writes without locking, then corruption can
result," and there definitely are apps that expect the filesystem to
work correctly even at very large scales.

I think O_LAZY would be reasonable to add, as long as that is what
applications are asking for, but we can't just break long-standing
data correctness behind their backs because it would go faster, and
there is no way for the filesystem to know without a flag like O_LAZY
if they are doing their own locking or not.

There is also a simple fallback to "#define O_LAZY 0" if it is not
defined on older systems, and then POSIX-compliant filesystems (not NFS)
will still work correctly, without the speedup that O_LAZY provides.

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP