On Jun 30, 2007 11:21 +0100, Christoph Hellwig wrote: > On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote: > > Currently it is left on the file system implementation. In ext4, we do > > not undo preallocation if some error (say, ENOSPC) is hit. Hence it may > > end up with partial (pre)allocation. This is inline with dd and > > posix_fallocate, which also do not free the partially allocated space. > > I can't find anything in the specification of posix_fallocate > (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html) > that tells what should happen to allocate blocks on error. > > But common sense would be to not leak disk space on failure of this > syscall, and this definitively should not be left up to the filesystem, > either we always leak it or always free it, and I'd strongly favour > the latter variant. I definitely agree that the behaviour should be specified part of the interface. The current behaviour of both ext4 and XFS is that the successful part of the unallocated extent is left in place when returning ENOSPC so we considered this the "consistent" behaviour. This is the same as e.g. sys_write() which does not remove the part of the write that was successful if ENOSPC is hit. I think this also makes sense for some usa cases, because application like PVR may want to preallocate approximately 30min of space, but if it gets only 25min worth then it can at least start using this while it also begins looking for and/or freeing old files. If the space is always freed on ENOSPC, then there may be a significant amount of work done and undone while the application is iterating over possible sizes until one works. It is easy for the application to use fstat() to see the blocks/size actually preallocated on failure, and explicitly request unallocation of this space if the outcome is undesirable. If you think that applications have a strong preference for both kinds of behaviour (e.g. database which requires the full allocation to succeed, unlike PVR application above) then this could be encoded into a @mode flag. > > > For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we > > > don't want to expose uninitialized disk blocks to userspace. I'm not > > > sure if this makes sense at all. > > This is the xfs unwritten extent behaviour. But anyway, the important bit > is uninitialized blocks should never ever leak to userspace, so there is > not need for the flag. I agree that we shouldn't need FA_ZERO_SPACE. If an application wants explicit zeros written to disk it can just do this with O_DIRECT writes or similar. > The more I think about it the more I'd prefer we would just put a simple > syscall in that implements nothing but the posix_fallocate(3) semantics > as defined in SuS, and then go on to brainstorm about advanced > preallocation / layout hint semantics. I don't think the current @mode flags introduce any significant complexity in the implementation, and in fact one of the reasons these came up in the first place was because David pointed out the XFS behaviour did NOT match with posix_fallocate() and we started getting strange semantics enforced by monolithic modes. IMHO, coding for and understanding the semantics of the monolithic modes is much more complex and less useful than the explicit flags. The @mode flags that are currently under consideration are (AFAIK): FA_FL_DEALLOC 0x01 /* deallocate unwritten extent (default allocate) */ FA_FL_KEEP_SIZE 0x02 /* keep size for EOF {pre,de}alloc (default change size) */ FA_FL_DEL_DATA 0x04 /* delete existing data in alloc range (default keep) */ Your concern about leaking space would imply: FA_FL_ERR_FREE 0x08 /* free preallocation on error (default keep prealloc) */ The other possible flags that were proposed, to avoid confusing backup and HSM applications when preallocated space is added or removed from a file (you don't want a backup app to re-backup a file that was migrated via HSM): FA_FL_NO_MTIME 0x10 /* keep same mtime (default change on size, data change) */ FA_FL_NO_CTIME 0x20 /* keep same ctime (default change on size, data change) */ Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html