Re: [fuse-devel] 512 byte aligned write + O_DIRECT for xfstests

Nikolaus Rath <Nikolaus@xxxxxxxx> · Mon, 22 Jun 2020 08:35:03 +0100

On Jun 22 2020, Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> [+CC fsdevel folks]
>
> On Mon, Jun 22, 2020 at 8:33 AM Nikolaus Rath <Nikolaus@xxxxxxxx> wrote:
>>
>> On Jun 21 2020, Miklos Szeredi <miklos@xxxxxxxxxx> wrote:
>> >> I am not sure that is correct. At step 6, the write() request from
>> >> userspace is still being processed. I don't think that it is reasonable
>> >> to expect that the write() request is atomic, i.e. you can't expect to
>> >> see none or all of the data that is *currently being written*.
>> >
>> > Apparently the standard is quite clear on this:
>> >
>> >   "All of the following functions shall be atomic with respect to each
>> > other in the effects specified in POSIX.1-2017 when they operate on
>> > regular files or symbolic links:
>> >
>> > [...]
>> > pread()
>> > read()
>> > readv()
>> > pwrite()
>> > write()
>> > writev()
>> > [...]
>> >
>> > If two threads each call one of these functions, each call shall
>> > either see all of the specified effects of the other call, or none of
>> > them."[1]
>> >
>> > Thanks,
>> > Miklos
>> >
>> > [1]
>> > https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07
>>
>> Thanks for digging this up, I did not know about this.
>>
>> That leaves FUSE in a rather uncomfortable place though, doesn't it?
>> What does the kernel do when userspace issues a write request that's
>> bigger than FUSE userspace pipe? It sounds like either the request must
>> be splitted (so it becomes non-atomic), or you'd have to return a short
>> write (which IIRC is not supposed to happen for local filesystems).
>>
>
> What makes you say that short writes are not supposed to happen?

I don't think it was an authoritative source, but I I've repeatedly read
that "you do not have to worry about short reads/writes when accessing
the local disk". I expect this to be a common expectation to be baked
into programs, no matter if valid or not.

> Seems like the options for FUSE are:
> - Take shared i_rwsem lock on read like XFS and regress performance of
>   mixed rw workload
> - Do the above only for non-direct and writeback_cache to minimize the
>   damage potential
> - Return short read/write for direct IO if request is bigger that FUSE
> buffer size
> - Add a FUSE mode that implements direct IO internally as something like
>   RWF_UNCACHED [2] - this is a relaxed version of "no caching" in client or
>   a stricter version of "cache write-through"  in the sense that
> during an ongoing
>   large write operation, read of those fresh written bytes only is served
>   from the client cache copy and not from the server.

I didn't understand all of that, but it seems to me that there is a
fundamental problem with splitting up a single write into multiple FUSE
requests, because the second request may fail after the first one
succeeds. 

Best,
-Nikolaus

-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«