On Dec 3, 2017, at 10:29 PM, NeilBrown <neilb@xxxxxxxx> wrote: > > On Tue, Nov 14 2017, Fu, Rodney wrote: > >>> The filesystem can still choose to do that for O_DIRECT if it wants - look at >>> all the filesystems that have a "fall back to buffered IO because this is too >>> hard to implement in the direct Io path". >> >> Yes, I agree that the filesystem can still decide to buffer IO even with >> O_DIRECT, but the application's intent is that the effects of caching are >> minimized. Whereas with O_CONCURRENT_WRITE, the intent is to maximize caching. >> >>> IOWs, you've got another set of custom userspace APIs that are needed to make >>> proper use of this open flag? >> >> Yes and no. Applications can make ioctls to the filesystem to query or set >> layout details but don't have to. Directory level default layout attributes can >> be set up by an admin to meet the requirements of the application. >> >>>> In panfs, a well behaved CONCURRENT_WRITE application will consider >>>> the file's layout on storage. Access from different machines will not >>>> overlap within the same RAID stripe so as not to cause distributed >>>> stripe lock contention. Writes to the file that are page aligned can >>>> be cached and the filesystem can aggregate multiple such writes before >>>> writing out to storage. Conversely, a CONCURRENT_WRITE application >>>> that ends up colliding on the same stripe will see worse performance. >>>> Non page aligned writes are treated by panfs as write-through and >>>> non-cachable, as the filesystem will have to assume that the region of >>>> the page that is untouched by this machine might in fact be written to >>>> on another machine. Caching such a page and writing it out later might lead to data corruption. >> >>> That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if >>> the app doesn't do correctly aligned and sized IO then performance is going to >>> suck, and if the apps doesn't serialize access to the file correctly it can and >>> will corrupt data in the file.... >> >> I make the same case as above, that O_DIRECT and O_CONCURRENT_WRITE have >> opposite intents with respect to caching. Our filesystem handles them >> differently, so we need to distinguish between the two. >> >>>> The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the >>>> application does not have to implement any caching to see good performance. >> >>> Sure, but it has to be aware of layout and where/how it can write, which is >>> exactly the same constraints that local filesystems place on O_DIRECT access. >> >>> Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics >>> and behaviour, IMO. >> >> I'd like to make a slight adjustment to my proposal. The HPC community had >> talked about extensions to POSIX to include O_LAZY as a way for filesystems to >> relax data coherency requirements. There is code in the ceph filesystem that >> uses that flag if defined. Can we get O_LAZY defined? > > This O_LAZY sounds exactly like what NFS has always done. > If different clients do page aligned writes and have their own protocol > to keep track of who owns which page, then everything is fine and > write-back caching does good things. > If different clients use byte-range locks, then write-back caching > is curtailed a bit, but clients don't need to be so careful. > If clients do non-aligned writes without locking, then corruption can > result. > So: > #define O_LAZY 0 > and NFS already has it implemented :-) > > For NFS, with have O_SYNC which tries to provide cache-coherency as strong > as other filesystems provide without it. > > Do we really want O_LAZY? Or are other filesystems trying too hard to > provide coherency when apps don't use locks? Well, POSIX requires the correct read-after-write behaviour regardless of whether applications are being careful or not. As you wrote above, "If clients do non-aligned writes without locking, then corruption can result," and there definitely are apps that expect the filesystem to work correctly even at very large scales. I think O_LAZY would be reasonable to add, as long as that is what applications are asking for, but we can't just break long-standing data correctness behind their backs because it would go faster, and there is no way for the filesystem to know without a flag like O_LAZY if they are doing their own locking or not. There is also a simple fallback to "#define O_LAZY 0" if it is not defined on older systems, and then POSIX-compliant filesystems (not NFS) will still work correctly, without the speedup that O_LAZY provides. Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP