Re: [RFC] write(2) semantics wrt return values and current position

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 6 Apr 2015 11:13:14 -0700

On Mon, Apr 6, 2015 at 9:02 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>
>         1) should we ever update the current position when write returns
> an error?  As it is, write(2) explicitly ignores any changes of position
> when ->write() has returned an error, but some other callers of vfs_write()
> are not so careful.

I think that question is the wrong way around.

If the write has ever been even partially successful, we should never
return an error. We should return the partial success.

So "error return" does imply "no position change", but not because we
shouldn't update the position for errors, but because we shouldn't
return error for anything that has updated the position.

NOTE! If we have magical non-POSIX files that update position even if
they otherwise return zero (don't ask me why, but we have various
non-posix behavior, maybe we have that case too), that's different.

Such a magical file might return an error (instead of zero) and update
position anyway. We have special files, they are ourside any POSIX
rules.

>         2) should we ever update the current position when write() returns 0?

I think the ext4 behavior is fine, although there's some noise in
POSIX about zero-sized writes being no-ops. I think the POSIX wording
is simply because some systems used to check for zero length before
even doing anything. In fact, I think Linux used to do that too, but
then we had special packetized formats that we wanted to syupport with
write() too (not just sendmsg), so that got removed.

I really don't think we should worry about it.

>         3) pwrite(2): POSIX seems to require ignoring the O_APPEND completely
> for that syscall.

That sounds insane too. But so is honoring it and ignoring the pwrite
offset. So I suspect the sane semantics for pwrite() with O_APPEND
should probably be to always just fail (the same way we fail pwrite on
nonseekable fd's), but if we have programs that use it, there's not a
lot we can do.

>         4) at lower level, there's a nasty case when short (but non-empty)
> O_DIRECT write followed by success of fallback to buffered write and a failure
> of filemap_write_and_wait_range() yields a return of the amount written by
> ->direct_IO() *and* update of current position by that plus the amount
> reported by buffered write.  IOW, we shift the offset by amount different
> from (positive) value we'll be returning from write(2).  That's a direct
> POSIX violation and I would expect the userland to be very surprised by
> running into that.  IMO it's a bug and we would be better off by shifting
> position by the amount we'll be returning.

That does sound like a bug. If we return a success value, and it's a
normal file (ie not the FAT "translate NL into NLCR" magic, or some
/proc seqfile etc), then I agree that the position should update by
the value we returned from write.

And there is no way I can see that a user program can depend on the
crazy behavior you describe.

>         5) somewhat related: nfs_direct_IO() ends up calling
> nfs_file_direct_write(), which calls generic_write_checks();
> it's triggered by swap-over-NFS (normal O_DIRECT writes go directly to
> nfs_file_direct_write()), and it ends up being subject to rlimit of
> caller.  Which might be anyone who calls alloc_pages(), AFAICS.  Almost
> certainly a bug.

Yeah, sounds bogus.

>         6) XFS seems to have fun bugs in O_DIRECT handling.  Consider
> the following scenario:
>         * O_DIRECT write() is called, we hit xfs_file_dio_aio_write().
>         * we check alignment and make decision whether to do
> xfs_rw_ilock exclusive (which will include i_mutex) or shared (which will
> not).  Suppose it takes that shared.
>         * we call xfs_file_aio_write_checks(), which, for starters, might
> modify position (on O_APPEND) and size (on rlimit).  Which renders the
> alignment checks useless, of course, but what's worse, it proceeds to
> calling xfs_break_layouts(), which might drop and retake XFS part of what's
> taken by xfs_rw_iolock().  Retake it exclusive, and update the iolock flag
> passed to it by reference accordingly.  And when we return to
> xfs_file_aio_write_checks(), and do xfs_rw_iunlock(), we'll end up dropping
> exclusively taken XFS part of things *and* ->i_mutex we'd never taken.
>         I might be misreading that code (it sure as hell wouldn't be
> the first time when xfs_{rw_,}_ilock() is involved), but it looks dubious
> to me...

I don't think aio_write() makes sense on an O_APPEND file (for the
same reason pwrite() doesn't), but we might be stuck with it.  People
who do that are insane and probably deserve whatever crazy semantics
they get (and if they rely on them, we shouldn't change them in the
name of "make things sane").

If the lack of proper locking causes coherence problems, that's a XFS
bug, of course.

>         My preference would be to have new_sync_write() and vfs_iter_write()
> to ignore iocb.ki_pos when ->write_iter() returns negative or zero (would
> take care of (1) and (2))

I guess that should be ok, because the (1) case is a "shouldn't happen
anyway", and the (2) case is debatable (and probably depends on
filesystem, so hopefully no user app can depend on it).

So I don't think the current behavior for 0 is necessarily wrong, but
I also don't care that strongly, so..

>     and have __generic_file_write_iter() to do
> ->ki_pos update in sync with what it'll be returning (takes care of (4)).

Yes.

> (3) is probably too old to fix, (5) should have generic_write_checks() done
> outside of fs/nfs/direct.c.  No idea on (6) and I would really like to hear
> from XFS folks before doing anything to that one.

Sounds ok by me.

                        Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html