There are several questions regarding the write(2) semantics, and I'd like to see comments on those. All of that is for regular files. 1) should we ever update the current position when write returns an error? As it is, write(2) explicitly ignores any changes of position when ->write() has returned an error, but some other callers of vfs_write() are not so careful. 2) should we ever update the current position when write() returns 0? IOW, what effect should zero-length write() on O_APPEND file have upon its current position? POSIX seems to imply that it should do nothing, and generally that's what happens, but e.g. ext4 *does* update position to the EOF, whether we will write anything or not. So does FUSE when server requests to bypass the page cache. AFAICS, lustre is the same way, but I might be missing something; everything else definitely does not update position in that case. IMO the common behaviour is correct and ext4 one is a bug. 3) pwrite(2): POSIX seems to require ignoring the O_APPEND completely for that syscall. We definitely do not. It's arguable whether this is desired or not, but it's an existing behaviour that had been that way since we'd got pwrite(2) in the kernel (2.1.60). Probably too late to do anything about that. 4) at lower level, there's a nasty case when short (but non-empty) O_DIRECT write followed by success of fallback to buffered write and a failure of filemap_write_and_wait_range() yields a return of the amount written by ->direct_IO() *and* update of current position by that plus the amount reported by buffered write. IOW, we shift the offset by amount different from (positive) value we'll be returning from write(2). That's a direct POSIX violation and I would expect the userland to be very surprised by running into that. IMO it's a bug and we would be better off by shifting position by the amount we'll be returning. 5) somewhat related: nfs_direct_IO() ends up calling nfs_file_direct_write(), which calls generic_write_checks(); it's triggered by swap-over-NFS (normal O_DIRECT writes go directly to nfs_file_direct_write()), and it ends up being subject to rlimit of caller. Which might be anyone who calls alloc_pages(), AFAICS. Almost certainly a bug. 6) XFS seems to have fun bugs in O_DIRECT handling. Consider the following scenario: * O_DIRECT write() is called, we hit xfs_file_dio_aio_write(). * we check alignment and make decision whether to do xfs_rw_ilock exclusive (which will include i_mutex) or shared (which will not). Suppose it takes that shared. * we call xfs_file_aio_write_checks(), which, for starters, might modify position (on O_APPEND) and size (on rlimit). Which renders the alignment checks useless, of course, but what's worse, it proceeds to calling xfs_break_layouts(), which might drop and retake XFS part of what's taken by xfs_rw_iolock(). Retake it exclusive, and update the iolock flag passed to it by reference accordingly. And when we return to xfs_file_aio_write_checks(), and do xfs_rw_iunlock(), we'll end up dropping exclusively taken XFS part of things *and* ->i_mutex we'd never taken. I might be misreading that code (it sure as hell wouldn't be the first time when xfs_{rw_,}_ilock() is involved), but it looks dubious to me... My preference would be to have new_sync_write() and vfs_iter_write() to ignore iocb.ki_pos when ->write_iter() returns negative or zero (would take care of (1) and (2)) and have __generic_file_write_iter() to do ->ki_pos update in sync with what it'll be returning (takes care of (4)). (3) is probably too old to fix, (5) should have generic_write_checks() done outside of fs/nfs/direct.c. No idea on (6) and I would really like to hear from XFS folks before doing anything to that one. Comments? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html