On Mon, Nov 13, 2017 at 05:02:20PM +0000, Fu, Rodney wrote: > > > > No. If you want new flags bits, make a public proposal. Maybe some > > > > other filesystem would also benefit from them. > > > > > > Ah, I see what you mean now, thanks. > > > > > > I would like to propose O_CONCURRENT_WRITE as a new open flag. It is > > > currently used in the Panasas filesystem (panfs) and defined with value: > > > > > > #define O_CONCURRENT_WRITE 020000000000 > > > > > > This flag has been provided by panfs to HPC users via the mpich > > > package for well over a decade. See: > > > > > > https://github.com/pmodels/mpich/blob/master/src/mpi/romio/adio/ad_pan > > > fs/ad_panfs_open6.c#L344 > > > > > > O_CONCURRENT_WRITE indicates to the filesystem that the application > > > doing the open is participating in a coordinated distributed manner > > > with other such applications, possibly running on different hosts. > > > This allows the panfs filesystem to delegate some of the cache > > > coherency responsibilities to the application, improving performance. > > > O_DIRECT already delegates responsibility for cache coherency to userspace > > applications and it allows for concurrent writes to a single file. Why do we > > need a new flag for this? > > > > The reason this flag is used on open as opposed to having a post-open > > > ioctl or fcntl SETFL is to allow panfs to catch and reject opens by > > > applications that attempt to access files that have already been > > > opened by applications that have set O_CONCURRENT_WRITE. > > > Sounds kinda like how we already use O_EXCL on block devices. > > Perhaps something like: > > > #define O_CONCURRENT_WRITE (O_DIRECT | O_EXCL) > > > To tell open to reject mixed mode access to the file on open? > > > -Dave. > > -- > > Dave Chinner > > david@xxxxxxxxxxxxx > > Thanks for this suggestion, but O_DIRECT has a significantly different meaning > to O_CONCURRENT_WRITE. O_DIRECT forces the filesystem to not cache read or > write data, while O_CONCURRENT_WRITE allows caching and concurrent distributed > access. I was not clear in my initial description of CONCURRENT_WRITE, so let > me add more details here. > > When O_CONCURRENT_WRITE is used, portions of read and write data are still > cachable in the filesystem. The filesystem can still choose to do that for O_DIRECT if it wants - look at all the filesystems that have a "fall back to buffered IO because this is too hard to implement in the direct Io path". > The filesystem continues to be responsible for > maintaining distributed coherency. Just like gfs2 and ocfs2 maintain distributed coherency when doing direct IO... > The user application is expected to provide > an access pattern that will allow the filesystem to cache data, thereby > improving performance. If the application misbehaves, the filesystem will still > guarantee coherency but at a performance cost, as portions of the file will have > to be treated as non-cacheable. IOWs, you've got another set of custom userspace APIs that are needed to make proper use of this open flag? > In panfs, a well behaved CONCURRENT_WRITE application will consider the file's > layout on storage. Access from different machines will not overlap within the > same RAID stripe so as not to cause distributed stripe lock contention. Writes > to the file that are page aligned can be cached and the filesystem can aggregate > multiple such writes before writing out to storage. Conversely, a > CONCURRENT_WRITE application that ends up colliding on the same stripe will see > worse performance. Non page aligned writes are treated by panfs as > write-through and non-cachable, as the filesystem will have to assume that the > region of the page that is untouched by this machine might in fact be written to > on another machine. Caching such a page and writing it out later might lead to > data corruption. That seems to fit the expected behaviour of O_DIRECT pretty damn closely - if the app doesn't do correctly aligned and sized IO then performance is going to suck, and if the apps doesn't serialise access to the file correctly it can and will corrupt data in the file.... > The benefit of CONCURRENT_WRITE is that unlike O_DIRECT, the application does > not have to implement any caching to see good performance. Sure, but it has to be aware of layout and where/how it can write, which is exactly the same constraints that local filesystems place on O_DIRECT access. > The intricacies of > maintaining distributed coherency are left to the filesystem instead of to > the application developer. Caching at the filesystem layer allows multiple > CONCURRENT_WRITE processes on the same machine to enjoy the performance benefits > of the page cache. > > Think of this as a hybrid between exclusive access to a file, where the > filesystem can cache everything and a simplistic shared mode where the > filesystem caches nothing. > > So we really do need a separate flag defined. Thanks! Not convinced. The use case fits pretty neatly into expected O_DIRECT semantics and behaviour, IMO. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx