[+CC fsdevel folks] On Mon, Jun 22, 2020 at 8:33 AM Nikolaus Rath <Nikolaus@xxxxxxxx> wrote: > > On Jun 21 2020, Miklos Szeredi <miklos@xxxxxxxxxx> wrote: > >> I am not sure that is correct. At step 6, the write() request from > >> userspace is still being processed. I don't think that it is reasonable > >> to expect that the write() request is atomic, i.e. you can't expect to > >> see none or all of the data that is *currently being written*. > > > > Apparently the standard is quite clear on this: > > > > "All of the following functions shall be atomic with respect to each > > other in the effects specified in POSIX.1-2017 when they operate on > > regular files or symbolic links: > > > > [...] > > pread() > > read() > > readv() > > pwrite() > > write() > > writev() > > [...] > > > > If two threads each call one of these functions, each call shall > > either see all of the specified effects of the other call, or none of > > them."[1] > > > > Thanks, > > Miklos > > > > [1] > > https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07 > > Thanks for digging this up, I did not know about this. > > That leaves FUSE in a rather uncomfortable place though, doesn't it? > What does the kernel do when userspace issues a write request that's > bigger than FUSE userspace pipe? It sounds like either the request must > be splitted (so it becomes non-atomic), or you'd have to return a short > write (which IIRC is not supposed to happen for local filesystems). > What makes you say that short writes are not supposed to happen? and what is the definition of "local filesystem" in that claim? FYI, a similar discussion is also happening about XFS "atomic rw" behavior [1]. Seems like the options for FUSE are: - Take shared i_rwsem lock on read like XFS and regress performance of mixed rw workload - Do the above only for non-direct and writeback_cache to minimize the damage potential - Return short read/write for direct IO if request is bigger that FUSE buffer size - Add a FUSE mode that implements direct IO internally as something like RWF_UNCACHED [2] - this is a relaxed version of "no caching" in client or a stricter version of "cache write-through" in the sense that during an ongoing large write operation, read of those fresh written bytes only is served from the client cache copy and not from the server. Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/20200622010234.GD2040@xxxxxxxxxxxxxxxxxxx/ [2] https://lore.kernel.org/linux-fsdevel/20191217143948.26380-1-axboe@xxxxxxxxx/