On Thu, Jun 03, 2010 at 05:55:38AM +1000, Nick Piggin wrote: > Well I looked through filesystems and I cannot seem to find anywhere > they can use ATTR_SIZE|ATTR_[MC]TIME to do anything strange that would > let them distinguish truncate from ftruncate and O_TRUNC. > > I couldn't quite understand what FUSE was trying to do, or whether I > improved it. Looking through the filesystems, I thought I found a number of issues. I'll try to list what I understand to be important steps required in setattr. Unless I'm way off, a lot of filesystems get these wrong, so we may want to add some guideline to docs? Anyway feedback on these would be appreciated. Below this there are some questions about specific filesystems too. Filesystems setattr methods should: - Perform idempotent error checks as early as possible. Perform actual i_size change and pagecache truncation last[*]. Truncate means dirty pagecache is fine to be thrown out at any time[**], what is important is that it is not thrown out before the operation can complete (eg. by doing the on-disk truncation). [*] Pagecache truncation is irreversable unless the filesystem can guarantee that the page is not mapped and has no dangling get_user_pages references (it's almost impossible to check this properly unless get_user_pages is disallowed from the start) or the file has been readonly. Just writing back the pagecache does not guarantee no data loss window. So it's probably easiest to do pagecache truncation last. [**] Be careful, this might mean the page is subject to writeout inside i_size after the on-disk truncation. Buffer based filesystems will probably need rework to handle this properly. - Update mtime and ctime IFF ATTR_MTIME and ATTR_CTIME are set in setattr. - Update mode before owner or size if mode update can fail. Otherwise killing of suid could fail after owner/size has changed (ideally these would be all done atomically). Although we do already have weird races in permission checking in core vfs code anyway. - Perform time modifications iff others have/will succeeded (these are applicable for chmod, chown, etc too). Do not succeed a truncate and then error out because the time cannot be set. Do not set the time but fail the truncate or chown etc. - If ATTR_SIZE change cannot succeed, don't ignore it, return something (like -EPERM or -EINVAL), right? (lots of filesystems just mask it off, maybe that is preferable behaviour sometimes?) - Remember to actually trim off disk blocks or underlying object past i_size in response to setattr shrinking the file. - In case of write_begin/write_end/direct_IO failure, trim object past i_size if it was possible to have been allocated before the operation fails. Interesting specific filesystem questions: (not comprehensive) ecryptfs improve mtime/ctime handling for truncate of lower inode fuse, should be setting ctime on truncate? gfs2, mtime and ctime should only be set in truncate if the ATTR_?TIME are set. hostfs, don't ignore ATTR_SIZE. ftruncate/truncate have interesting different semantics for timestamp updates, so don't use fd != -1 for these (could use utimes for this, but it causes an atomicity/error handling issue). hpfs could use cont_expand to do expanding truncates? ncpfs, nfs smbfs, cifs seems to perform inode size checks (inode_newsize_ok) after truncating the file on the server?? ntfs does not do inode size checks properly. logfs logfs_truncate without doing inode_change_ok, inode_newsize_ok etc can fail inode_setattr after successful truncate (truncate(2) would return failure and yet the file would be truncated). procfs why not return -EPERM rather than setting size? reiserfs should do all setattr checks early as possible (inode_change_ok, inode_newsize_ok). ubifs should not update access times unconditionally when ATTR_SIZE is set ufs simple_setsize still destroys the pagecache and causes concurrent reads to go EOF past updated i_size. Restoring old i_size is too late I think. Should do those checks first (in fairness lots of filesystems have bit problems with error handling, but ufs attempts a little bit and gets it wrong). xfs in the 0 length file shortcut, isn't this missing suid kill case? (ftruncate mandatory mtime update seems like it wasn't working right there either before this patch). nfsd should not change attributes in the case of truncated file with no size change? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html