On Jun 22 2020, Amir Goldstein <amir73il@xxxxxxxxx> wrote: > [+CC fsdevel folks] > > On Mon, Jun 22, 2020 at 8:33 AM Nikolaus Rath <Nikolaus@xxxxxxxx> wrote: >> >> On Jun 21 2020, Miklos Szeredi <miklos@xxxxxxxxxx> wrote: >> >> I am not sure that is correct. At step 6, the write() request from >> >> userspace is still being processed. I don't think that it is reasonable >> >> to expect that the write() request is atomic, i.e. you can't expect to >> >> see none or all of the data that is *currently being written*. >> > >> > Apparently the standard is quite clear on this: >> > >> > "All of the following functions shall be atomic with respect to each >> > other in the effects specified in POSIX.1-2017 when they operate on >> > regular files or symbolic links: >> > >> > [...] >> > pread() >> > read() >> > readv() >> > pwrite() >> > write() >> > writev() >> > [...] >> > >> > If two threads each call one of these functions, each call shall >> > either see all of the specified effects of the other call, or none of >> > them."[1] >> > >> > Thanks, >> > Miklos >> > >> > [1] >> > https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07 >> >> Thanks for digging this up, I did not know about this. >> >> That leaves FUSE in a rather uncomfortable place though, doesn't it? >> What does the kernel do when userspace issues a write request that's >> bigger than FUSE userspace pipe? It sounds like either the request must >> be splitted (so it becomes non-atomic), or you'd have to return a short >> write (which IIRC is not supposed to happen for local filesystems). >> > > What makes you say that short writes are not supposed to happen? I don't think it was an authoritative source, but I I've repeatedly read that "you do not have to worry about short reads/writes when accessing the local disk". I expect this to be a common expectation to be baked into programs, no matter if valid or not. > Seems like the options for FUSE are: > - Take shared i_rwsem lock on read like XFS and regress performance of > mixed rw workload > - Do the above only for non-direct and writeback_cache to minimize the > damage potential > - Return short read/write for direct IO if request is bigger that FUSE > buffer size > - Add a FUSE mode that implements direct IO internally as something like > RWF_UNCACHED [2] - this is a relaxed version of "no caching" in client or > a stricter version of "cache write-through" in the sense that > during an ongoing > large write operation, read of those fresh written bytes only is served > from the client cache copy and not from the server. I didn't understand all of that, but it seems to me that there is a fundamental problem with splitting up a single write into multiple FUSE requests, because the second request may fail after the first one succeeds. Best, -Nikolaus -- GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«