On Wed, Mar 07, 2012 at 12:04:26PM -0600, Eric Sandeen wrote: > On 3/7/12 11:16 AM, Brian Candler wrote: > > On Wed, Mar 07, 2012 at 03:54:39PM +0000, Brian Candler wrote: > >> core.size = 1085407232 > >> core.nblocks = 262370 > > > > core.nblocks is correct here: space used = 262370 * 4 = 1049480 KB > > > > (If I add up all the non-hole extents I get 2098944 blocks = 1049472 KB > > so there are two extra blocks of something) > > > > This begs the question of where stat() is getting its info from? stat(2) also reported delayed allocation reservations that are only kept in memory. .... > so: > > # dd if=/dev/zero of=bigfile bs=1M count=1100 &>/dev/null > # ls -lh bigfile > -rw-r--r--. 1 root root 1.1G Mar 7 11:47 bigfile > # du -h bigfile > 1.1G bigfile > > but: > > # rm -f bigfile > # for I in `seq 1 1100`; do dd if=/dev/zero of=bigfile conv=notrunc bs=1M seek=$I count=1 &>/dev/null; done > # ls -lh bigfile > -rw-r--r--. 1 root root 1.1G Mar 7 11:49 bigfile > # du -h bigfile > 2.0G bigfile This is tripping the NFS server write pattern heuristic. i.e. it is detecting repeated open/write at EOF/close patterns and so is not truncating away the speculative EOF reservation on close(). This si what prevents fragmentation of files being written concurrently with this pattern. > This should get freed when the inode is dropped from the cache; > hence your cache drop bringing it back to size. Right. It is assumes that once you've triggered that heuristic, the preallocation needs to last for as long as the inode is in the working set. The inode cache tracks the current working set, so the preallocation release is tied to cache eviction. > But there does seem to be an issue here; if I make a 4G filesystem > and repeat the above test 3 times, the 3rd run gets ENOSPC, and > the last file written comes up short, while the first one retains > all it's extra preallocated space: > > # du -hc bigfile* 2.0G bigfile1 1.1G bigfile2 907M > bigfile3 > > Dave, is this working as intended? Yes. Your problem is that you have a very small filesystem, which is not the case that we optimise XFS for. :/ > I know the speculative > preallocation amount for new files is supposed to go down as the > fs fills, but is there no way to discard prealloc space to avoid > ENOSPC on other files? We don't track what files have current active preallocations, we only reduce the preallocation size as the filesystem nears ENOSPC. This generally works just fine in situations where the filesystem size is significantly greater than the maximum extent size. i.e. the common case The problem you are tripping over here is that the maximum extent size is greater than the filesystem size, so the preallocation size is also greater than the filesystem size and hence can contribute significantly to premature ENOSPC. I see two possible ways to minimise this problem: 1. reduce the maximum speculative preallocation size based on filesystem size at mount time. 2. track inodes with active speculative preallocation and have an enospc based trigger that can find them and truncate away excess idle speculative preallocation. The first is relatively easy to do, but will only reduce the incidence of your problem - we still need to allow significant preallocation sizes (e.g. 64MB) to avoid the fragmentation problems. The second is needed to reclaim the space we've already preallocated but is not being used. That's more complex to do - probably a radix tree bit and a periodic background scan to reduce the time window the preallocation sits around from cache lifetime to "idle for some time" along with a on-demand, synchronous ENOSPC scan. This will need some more thought as to how to do it effectively, but isn't impossible to do.... Cheers, Dave. > > -Eric > > > root@storage1:~# du -h /disk*/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 2.0G /disk10/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 2.0G /disk11/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 2.0G /disk12/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 1.1G /disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 1.1G /disk2/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 2.0G /disk3/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 2.0G /disk4/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 2.0G /disk5/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 2.0G /disk6/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 2.0G /disk7/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 2.0G /disk8/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 2.0G /disk9/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > root@storage1:~# echo 3 >/proc/sys/vm/drop_caches > > root@storage1:~# du -h /disk*/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 1.1G /disk10/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 1.1G /disk11/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 1.1G /disk12/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 1.1G /disk1/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 1.1G /disk2/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 1.1G /disk3/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 1.1G /disk4/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 1.1G /disk5/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 1.1G /disk6/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 1.1G /disk7/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 1.1G /disk8/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > 1.1G /disk9/scratch2/work/PRSRA1/PRSRA1.1.0.bff > > root@storage1:~# > > > > Very odd, but not really a major problem other than the confusion it causes. > > > > Regards, > > > > Brian. > > > > _______________________________________________ > > xfs mailing list > > xfs@xxxxxxxxxxx > > http://oss.sgi.com/mailman/listinfo/xfs > > > > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs > -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs