On Mon, Apr 06, 2015 at 11:13:14AM -0700, Linus Torvalds wrote: > On Mon, Apr 6, 2015 at 9:02 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > > > 1) should we ever update the current position when write returns > > an error? As it is, write(2) explicitly ignores any changes of position > > when ->write() has returned an error, but some other callers of vfs_write() > > are not so careful. > > I think that question is the wrong way around. > > If the write has ever been even partially successful, we should never > return an error. We should return the partial success. Check what happens if generic_write_sync() (from generic_file_write_iter()) fails. That's the kind of thing I'm worried about. Sure, an error halfway through => short write, no problem with that. We do handle that. > > 2) should we ever update the current position when write() returns 0? > > I think the ext4 behavior is fine, although there's some noise in > POSIX about zero-sized writes being no-ops. I think the POSIX wording > is simply because some systems used to check for zero length before > even doing anything. In fact, I think Linux used to do that too, but > then we had special packetized formats that we wanted to syupport with > write() too (not just sendmsg), so that got removed. > > I really don't think we should worry about it. No, it's just that it would be a lot more convenient to have it ki_pos discarded (in new_sync_write() and vfs_iter_write()) when ->write_iter() returns 0 or negative. As it is, we do rather clumsy dances in generic_write_checks() and it would be much nicer if we could simply pass iocb and iter to it and have it update ->ki_pos (in O_APPEND case) and do iov_iter_turncate(). And yes, it needs massage to get iocb to all callers; the main obstacle used to be v9fs_file_write(), but that's got dealt with in my tree. > > 4) at lower level, there's a nasty case when short (but non-empty) > > O_DIRECT write followed by success of fallback to buffered write and a failure > > of filemap_write_and_wait_range() yields a return of the amount written by > > ->direct_IO() *and* update of current position by that plus the amount > > reported by buffered write. IOW, we shift the offset by amount different > > from (positive) value we'll be returning from write(2). That's a direct > > POSIX violation and I would expect the userland to be very surprised by > > running into that. IMO it's a bug and we would be better off by shifting > > position by the amount we'll be returning. > > That does sound like a bug. If we return a success value, and it's a > normal file (ie not the FAT "translate NL into NLCR" magic, or some > /proc seqfile etc), then I agree that the position should update by > the value we returned from write. ext* and friends. It's in __generic_file_write_iter(). > > 6) XFS seems to have fun bugs in O_DIRECT handling. Consider > > the following scenario: > > * O_DIRECT write() is called, we hit xfs_file_dio_aio_write(). > > * we check alignment and make decision whether to do > > xfs_rw_ilock exclusive (which will include i_mutex) or shared (which will > > not). Suppose it takes that shared. > > * we call xfs_file_aio_write_checks(), which, for starters, might > > modify position (on O_APPEND) and size (on rlimit). Which renders the > > alignment checks useless, of course, but what's worse, it proceeds to > > calling xfs_break_layouts(), which might drop and retake XFS part of what's > > taken by xfs_rw_iolock(). Retake it exclusive, and update the iolock flag > > passed to it by reference accordingly. And when we return to > > xfs_file_aio_write_checks(), and do xfs_rw_iunlock(), we'll end up dropping > > exclusively taken XFS part of things *and* ->i_mutex we'd never taken. > > I might be misreading that code (it sure as hell wouldn't be > > the first time when xfs_{rw_,}_ilock() is involved), but it looks dubious > > to me... > > I don't think aio_write() makes sense on an O_APPEND file (for the > same reason pwrite() doesn't), but we might be stuck with it. People > who do that are insane and probably deserve whatever crazy semantics > they get (and if they rely on them, we shouldn't change them in the > name of "make things sane"). > > If the lack of proper locking causes coherence problems, that's a XFS > bug, of course. It doesn't have to be O_DIRECT; setrlimit(2) from another thread is enough to screw the alignment to hell and back and mutex_unlock() of something we hadn't done mutex_lock() to is definitely a bug (don't need O_APPEND to trigger that either; needs pNFS involved, AFAICS). I'd really like comments from Christoph and Dave on that on... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html