On Tue, Dec 11, 2012 at 12:16:03PM +0000, Steven Whitehouse wrote: > On Mon, 2012-12-10 at 13:20 -0500, Theodore Ts'o wrote: > > A sentence or two got chopped out during an editing pass. Let me try > > that again so it's a bit clearer what I was trying to say.... > > > > Sure, but if the block device supports WRITE_SAME or persistent > > discard, then presumably fallocate() should do this automatically all > > the time, and not require a flag to request this behavior. The only > > reason why you might not is if the WRITE_SAME is more costly. That is > > when a seek plus writing 1MB does take more time than the amount of > > disk time fraction that it consumes if you compare it to a seek plus > > writing 4k or 32k. > > > Well there are two cases here I think.... /me nods > The other case is ext4/XFS type case where the metadata does support > "these blocks are allocated but zero" which means that the metadata > needs to be changed twice. Once to "these blocks are allocated but zero" > at fallocate time and again to "these blocks have valid content" at > write time. As I understand the issue, the problem is that this second > metadata change is what is causing the performance issue. Apparently so ;) > > Ext4 currently uses a threshold of 32k for this break point (below > > that, we will use sb_issue_zeroout; above that, we will break apart an > > uninitialized extent when writing into a preallocated region). It may > > be that 32k is too low, especailly for certain types of devices (i.e., > > SSD's versus RAID 5, where it should be aligned on a RAID strip, > > etc.). More of an issue might be that there will be some disagreement > > about whether people want to the system to automatically tune for > > average throughput vs 99.9 percentile latency. > > > > Regardless, this is actually something which I think the file system > > should try to do automatically if at all possible, via some kind of > > auto-tuning hueristic, instead of using an explicit fallocate(2) flag. > > (See, I don't propose using a new fallocate flag for everything. :-) Preallocation is primarily there for the application to optimise file layout when the filesystem heuristics fail to do the right thing. IOWs, we should not be adding heuristics to "tune" fallocate behaviour automatically because a) we are guaranteed to get heuristics wrong for some applications, and b) fallocate is for direct application control and hence avoid filesystem hueristic based problems. > It sounds like it might well be worth experimenting with the thresholds > as you suggest, 32k is really pretty small. I guess that the real > question here is what is the cost of the metadata change (to say what is > written and what remains unwritten) vs. simply zeroing out the unwritten > blocks in the extent when the write occurs. Exactly my concern. 32k is a range that is arbitrary, and if it cannot be changed to suit your workload then if it sucks.... Given that fallocate is supposed to allow applications to avoid such allocation heuristics if they are sub-optimal, this fixed "zero-around" range seems like a bad model to follow. If I implement a zero-around for direct IO on unwritten extents for XFS, I'll use the per-inode extent granularity hint that is held in the inode to determine the range to write zeros to. That will enable us to have a per-file configurable allocation/unwritten extent conversion granularity and hence bound the maximum size of the extent tree for a given workload+data set. The real problem is that to do zero-around efficiently, the zeroing needs to be done as part of the direct IO submission process. That code only supports block size zeroing rather than arbitrary ranges, and it's already very messy because of the rat's nest of micro-optimisations. However, doing the zero-around as part of IO submission would certainly remove a significant amount of the ongoing extent conversion overhead from the workloads Chris has pointed out as troublesome... > There are likely to be a number of factors affecting that, and the > answer doesn't appear straightforward, Right, which is where making it per-file configurable makes sense. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html