Re: [fuse-devel] 512 byte aligned write + O_DIRECT for xfstests

Amir Goldstein <amir73il@xxxxxxxxx> · Mon, 22 Jun 2020 10:57:50 +0300

On Mon, Jun 22, 2020 at 10:35 AM Nikolaus Rath <Nikolaus@xxxxxxxx> wrote:
>
> On Jun 22 2020, Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> > [+CC fsdevel folks]
> >
> > On Mon, Jun 22, 2020 at 8:33 AM Nikolaus Rath <Nikolaus@xxxxxxxx> wrote:
> >>
> >> On Jun 21 2020, Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
> >> >> I am not sure that is correct. At step 6, the write() request from
> >> >> userspace is still being processed. I don't think that it is reasonable
> >> >> to expect that the write() request is atomic, i.e. you can't expect to
> >> >> see none or all of the data that is *currently being written*.
> >> >
> >> > Apparently the standard is quite clear on this:
> >> >
> >> >   "All of the following functions shall be atomic with respect to each
> >> > other in the effects specified in POSIX.1-2017 when they operate on
> >> > regular files or symbolic links:
> >> >
> >> > [...]
> >> > pread()
> >> > read()
> >> > readv()
> >> > pwrite()
> >> > write()
> >> > writev()
> >> > [...]
> >> >
> >> > If two threads each call one of these functions, each call shall
> >> > either see all of the specified effects of the other call, or none of
> >> > them."[1]
> >> >
> >> > Thanks,
> >> > Miklos
> >> >
> >> > [1]
> >> > https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07
> >>
> >> Thanks for digging this up, I did not know about this.
> >>
> >> That leaves FUSE in a rather uncomfortable place though, doesn't it?
> >> What does the kernel do when userspace issues a write request that's
> >> bigger than FUSE userspace pipe? It sounds like either the request must
> >> be splitted (so it becomes non-atomic), or you'd have to return a short
> >> write (which IIRC is not supposed to happen for local filesystems).
> >>
> >
> > What makes you say that short writes are not supposed to happen?
>
> I don't think it was an authoritative source, but I I've repeatedly read
> that "you do not have to worry about short reads/writes when accessing
> the local disk". I expect this to be a common expectation to be baked
> into programs, no matter if valid or not.
>

Even if that statement would have been considered true, since when can
we speak of FUSE as a "local filesystem".
IMO it follows all the characteristics of a "network filesystem".

> > Seems like the options for FUSE are:
> > - Take shared i_rwsem lock on read like XFS and regress performance of
> >   mixed rw workload
> > - Do the above only for non-direct and writeback_cache to minimize the
> >   damage potential
> > - Return short read/write for direct IO if request is bigger that FUSE
> > buffer size
> > - Add a FUSE mode that implements direct IO internally as something like
> >   RWF_UNCACHED [2] - this is a relaxed version of "no caching" in client or
> >   a stricter version of "cache write-through"  in the sense that
> > during an ongoing
> >   large write operation, read of those fresh written bytes only is served
> >   from the client cache copy and not from the server.
>
> I didn't understand all of that, but it seems to me that there is a
> fundamental problem with splitting up a single write into multiple FUSE
> requests, because the second request may fail after the first one
> succeeds.
>

I think you are confused by the use of the word "atomic" in the standard.
It does not mean what the O_ATOMIC proposal means, that is - write everything
or write nothing at all.
It means if thread A successfully wrote data X over data Y, then thread B can
either read X or Y, but not half X half Y.
If A got an error on write, the content that B will read is probably undefined
(excuse me for not reading what "the law" has to say about this).
If A got a short (half) write, then surely B can read either half X or half Y
from the first half range. Second half range I am not sure what to expect.

So I do not see any fundamental problem with FUSE write requests.
On the contrary - FUSE write requests are just like any network protocol write
request or local disk IO request for that matter.

Unless I am missing something...

Thanks,
Amir.