On Mon, Jun 22, 2020 at 10:35 AM Nikolaus Rath <Nikolaus@xxxxxxxx> wrote: > > On Jun 22 2020, Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > [+CC fsdevel folks] > > > > On Mon, Jun 22, 2020 at 8:33 AM Nikolaus Rath <Nikolaus@xxxxxxxx> wrote: > >> > >> On Jun 21 2020, Miklos Szeredi <miklos@xxxxxxxxxx> wrote: > >> >> I am not sure that is correct. At step 6, the write() request from > >> >> userspace is still being processed. I don't think that it is reasonable > >> >> to expect that the write() request is atomic, i.e. you can't expect to > >> >> see none or all of the data that is *currently being written*. > >> > > >> > Apparently the standard is quite clear on this: > >> > > >> > "All of the following functions shall be atomic with respect to each > >> > other in the effects specified in POSIX.1-2017 when they operate on > >> > regular files or symbolic links: > >> > > >> > [...] > >> > pread() > >> > read() > >> > readv() > >> > pwrite() > >> > write() > >> > writev() > >> > [...] > >> > > >> > If two threads each call one of these functions, each call shall > >> > either see all of the specified effects of the other call, or none of > >> > them."[1] > >> > > >> > Thanks, > >> > Miklos > >> > > >> > [1] > >> > https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07 > >> > >> Thanks for digging this up, I did not know about this. > >> > >> That leaves FUSE in a rather uncomfortable place though, doesn't it? > >> What does the kernel do when userspace issues a write request that's > >> bigger than FUSE userspace pipe? It sounds like either the request must > >> be splitted (so it becomes non-atomic), or you'd have to return a short > >> write (which IIRC is not supposed to happen for local filesystems). > >> > > > > What makes you say that short writes are not supposed to happen? > > I don't think it was an authoritative source, but I I've repeatedly read > that "you do not have to worry about short reads/writes when accessing > the local disk". I expect this to be a common expectation to be baked > into programs, no matter if valid or not. > Even if that statement would have been considered true, since when can we speak of FUSE as a "local filesystem". IMO it follows all the characteristics of a "network filesystem". > > Seems like the options for FUSE are: > > - Take shared i_rwsem lock on read like XFS and regress performance of > > mixed rw workload > > - Do the above only for non-direct and writeback_cache to minimize the > > damage potential > > - Return short read/write for direct IO if request is bigger that FUSE > > buffer size > > - Add a FUSE mode that implements direct IO internally as something like > > RWF_UNCACHED [2] - this is a relaxed version of "no caching" in client or > > a stricter version of "cache write-through" in the sense that > > during an ongoing > > large write operation, read of those fresh written bytes only is served > > from the client cache copy and not from the server. > > I didn't understand all of that, but it seems to me that there is a > fundamental problem with splitting up a single write into multiple FUSE > requests, because the second request may fail after the first one > succeeds. > I think you are confused by the use of the word "atomic" in the standard. It does not mean what the O_ATOMIC proposal means, that is - write everything or write nothing at all. It means if thread A successfully wrote data X over data Y, then thread B can either read X or Y, but not half X half Y. If A got an error on write, the content that B will read is probably undefined (excuse me for not reading what "the law" has to say about this). If A got a short (half) write, then surely B can read either half X or half Y from the first half range. Second half range I am not sure what to expect. So I do not see any fundamental problem with FUSE write requests. On the contrary - FUSE write requests are just like any network protocol write request or local disk IO request for that matter. Unless I am missing something... Thanks, Amir.