Re: [RFC] write(2) semantics wrt return values and current position

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Mon, 6 Apr 2015 20:29:03 +0100

On Mon, Apr 06, 2015 at 11:13:14AM -0700, Linus Torvalds wrote:
> On Mon, Apr 6, 2015 at 9:02 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> >
> >         1) should we ever update the current position when write returns
> > an error?  As it is, write(2) explicitly ignores any changes of position
> > when ->write() has returned an error, but some other callers of vfs_write()
> > are not so careful.
> 
> I think that question is the wrong way around.
> 
> If the write has ever been even partially successful, we should never
> return an error. We should return the partial success.

Check what happens if generic_write_sync() (from generic_file_write_iter())
fails.  That's the kind of thing I'm worried about.  Sure, an error halfway
through => short write, no problem with that.  We do handle that.

> >         2) should we ever update the current position when write() returns 0?
> 
> I think the ext4 behavior is fine, although there's some noise in
> POSIX about zero-sized writes being no-ops. I think the POSIX wording
> is simply because some systems used to check for zero length before
> even doing anything. In fact, I think Linux used to do that too, but
> then we had special packetized formats that we wanted to syupport with
> write() too (not just sendmsg), so that got removed.
> 
> I really don't think we should worry about it.

No, it's just that it would be a lot more convenient to have it ki_pos
discarded (in new_sync_write() and vfs_iter_write()) when ->write_iter()
returns 0 or negative.  As it is, we do rather clumsy dances in
generic_write_checks() and it would be much nicer if we could simply
pass iocb and iter to it and have it update ->ki_pos (in O_APPEND case)
and do iov_iter_turncate().  And yes, it needs massage to get iocb to
all callers; the main obstacle used to be v9fs_file_write(), but that's
got dealt with in my tree.

> >         4) at lower level, there's a nasty case when short (but non-empty)
> > O_DIRECT write followed by success of fallback to buffered write and a failure
> > of filemap_write_and_wait_range() yields a return of the amount written by
> > ->direct_IO() *and* update of current position by that plus the amount
> > reported by buffered write.  IOW, we shift the offset by amount different
> > from (positive) value we'll be returning from write(2).  That's a direct
> > POSIX violation and I would expect the userland to be very surprised by
> > running into that.  IMO it's a bug and we would be better off by shifting
> > position by the amount we'll be returning.
> 
> That does sound like a bug. If we return a success value, and it's a
> normal file (ie not the FAT "translate NL into NLCR" magic, or some
> /proc seqfile etc), then I agree that the position should update by
> the value we returned from write.

ext* and friends.  It's in __generic_file_write_iter().

> >         6) XFS seems to have fun bugs in O_DIRECT handling.  Consider
> > the following scenario:
> >         * O_DIRECT write() is called, we hit xfs_file_dio_aio_write().
> >         * we check alignment and make decision whether to do
> > xfs_rw_ilock exclusive (which will include i_mutex) or shared (which will
> > not).  Suppose it takes that shared.
> >         * we call xfs_file_aio_write_checks(), which, for starters, might
> > modify position (on O_APPEND) and size (on rlimit).  Which renders the
> > alignment checks useless, of course, but what's worse, it proceeds to
> > calling xfs_break_layouts(), which might drop and retake XFS part of what's
> > taken by xfs_rw_iolock().  Retake it exclusive, and update the iolock flag
> > passed to it by reference accordingly.  And when we return to
> > xfs_file_aio_write_checks(), and do xfs_rw_iunlock(), we'll end up dropping
> > exclusively taken XFS part of things *and* ->i_mutex we'd never taken.
> >         I might be misreading that code (it sure as hell wouldn't be
> > the first time when xfs_{rw_,}_ilock() is involved), but it looks dubious
> > to me...
> 
> I don't think aio_write() makes sense on an O_APPEND file (for the
> same reason pwrite() doesn't), but we might be stuck with it.  People
> who do that are insane and probably deserve whatever crazy semantics
> they get (and if they rely on them, we shouldn't change them in the
> name of "make things sane").
> 
> If the lack of proper locking causes coherence problems, that's a XFS
> bug, of course.

It doesn't have to be O_DIRECT; setrlimit(2) from another thread is enough
to screw the alignment to hell and back and mutex_unlock() of something
we hadn't done mutex_lock() to is definitely a bug (don't need O_APPEND to
trigger that either; needs pNFS involved, AFAICS).  I'd really like comments
from Christoph and Dave on that on...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html