On Thu, Jan 06, 2011 at 09:10:29AM +0100, Michael Monnerie wrote: > On Mittwoch, 5. Januar 2011 Dave Chinner wrote: > > No state or additional on-disk > > structures are needed for xfs_fsr to do it's work.... > > That's not exactly the same - once you defraged a file, you know it's > done, and can skip it next time. Sure, but the way xfs_fsr skips it is by physically checking the inode on the next filesystem pass. It does that efficiently because the necessary information is cheap to read (via bulkstat), not because we track what needs defrag in the filesystem on every operation. > But you dont know if the (free) space > between block 0 and 20 on disk has been rewritten since the last trim > run or not used at all, so you'd have to do it all again. Sure, but the block device should, and therefore a TRIM to an area with nothing to trim should be fast. The current generation drives still have problems with this, but once device implementations are better optimised there should be little penalty for trying to trim a region that currently holds no data on the device. basically we need to design for the future, not for the limitations the current generation of devices have.... > > The background trim is intended to enable even the slowest of > > devices to be trimmed over time, while introducing as little runtime > > overhead and complexity as possible. Hence adding complexity and > > runtime overhead to optimise background trimming tends to defeat the > > primary design goal.... > > It would be interesting to have real world numbers to see what's "best". > I'd imagine a normal file or web server to store tons of files that are > mostly read-only, while 5% of it a used a lot, as well as lots of temp > files. For this, knowing what's been used would be great. A filesystem does not necessarily reuse the same blocks for temporary data. That "5%" of data that is written and erase all the time could end up spanning 50% of the filesystem free space over the period of a week.... > Also, I'm thinking of a NetApp storage, that has been setup to run > deduplication on Sunday. It's best to run trim on Saturday and it should > be finished before Sunday. For big storages that might be not easy to > finish, if all disk space has to be freed explicitly. > > And wouldn't it still be cheaper to keep a "written bmap" than to run > over the full space of a (big) disk? I'd say depends on the workload. So, lets keep a "used free space" tree in the filesystem for this purpose. I'll spell out what it means in terms of runtime overhead for you. Firstly, every extent that is freed now needs to be inserted into the new used free space tree. That means transactions reservations all increase in size by 30%, log traffic increases by 30%, cpu overhead increases by ~30%, buffer cache footprint increases by 30% and we've got 30% more metadata to write to disk. (30% because there are already 2 free space btrees that are updated on every extent free.) Secondly, when we allocate an extent, we now have to check whether the extent is in the used free space btree and remove it from there if it is. That adds another btree lookup and modification to the allocation code, which adds roughly 30% overhead there as well. That's a lot of additional runtime overhead. And then we have to consider the userspace utilities - we need to add code to mkfs, xfs_repair, xfs_db, etc to enable checking and repairing of the new btree, cross checking that every extent in used free space tree is in the free space tree, etc. That's a lot of work on top of just the kernel allocation code changes to keep the new tree up to date. IMO, tracking used free space to optimise background trim is premature optimisation - it might be needed for a year or two, but it will take at least that long to get such an optimisation stable enough to consider for enterprise distros. And at which point, it probably isn't going to be needed anymore. Realistically, we have to design for how we expect devices to behave in 2-3 years time, not waste time trying to optimise for fundamentally broken devices that nobody will be using in 2-3 years time... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs