On Tue, Nov 13, 2012 at 11:13:55AM +0200, Linas Jankauskas wrote: > trace-cmd output was about 300mb, so im pasting first 100 lines of > it, is it enough?: .... > > Rsync command: > > /usr/bin/rsync -e ssh -c blowfish -a --inplace --numeric-ids > --hard-links --ignore-errors --delete --force Ok, so you are overwriting in place and deleting files/dirs that don't exist anymore. And they are all small files. > xfs_bmap on one random file: > > EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET > TOTAL FLAGS > 0: [0..991]: 26524782560..26524783551 12 (754978880..754979871) 992 00000 > > xfs_db -r -c "frag" /dev/sda5 > actual 81347252, ideal 80737778, fragmentation factor 0.75% And that indicates file fragmentation is not an issue. > > > agno: 0 Not too bad. > agno: 1 > > from to extents blocks pct > 1 1 74085 74085 0.05 > 2 3 97017 237788 0.15 > 4 7 165766 918075 0.59 > 8 15 2557055 35731152 22.78 And there's the problem. Free space is massively fragmented in the 8-16 block size (32-64k) range. All the other AGs show the same pattern: > 8 15 2477693 34631683 18.51 > 8 15 2479273 34656696 20.37 > 8 15 2440290 34132542 20.51 > 8 15 2461646 34419704 20.38 > 8 15 2463571 34439233 21.06 > 8 15 2487324 34785498 19.92 > 8 15 2474275 34589732 19.85 > 8 15 2438528 34100460 20.69 > 8 15 2467056 34493555 20.04 > 8 15 2457983 34364055 20.14 > 8 15 2438076 34112592 22.48 > 8 15 2465147 34481897 19.79 > 8 15 2466844 34492253 21.44 > 8 15 2445986 34205258 21.35 > 8 15 2436154 34060275 19.60 > 8 15 2438373 34082653 20.59 > 8 15 2435860 34057838 21.01 Given the uniform distribution of the freespace fragmentation, the problem is most likely the fact you are using the inode32 allocator. What is does is keep inodes in AG 0 (below 1TB) and rotor's data extents across all other AGs. Hence AG 0 has a different freespace pattern because it mainly contains metadata. The data AGs are showing the signs of files with no reference locality being packed adjacent to each other when written, then randomly removed, which leaves a swiss-cheese style of freespace fragmentation. The result is freespace btrees that are much, much larger than usual, and each AG is being randomly accessed by each userspace process. This leads to long lock hold times during searches, and access from multiple CPUs at once slows things down and adds to lock contention. It appears that the threshold that limits performance for your workload and configuration is around 2.5million freespace extents in a single size range. most likely it is a linear scan of duplicate sizes trying to find the best block number match that is chewing up all the CPU. That's roughly what the event trace shows. I don't think you can fix a filesystem once it's got into this state. It's aged severely and the only way to fix freespace fragmentation is to remove files from the filesystem. In this case, mkfs.xfs is going to be the only sane way to do that, because it's much faster than removing 90million inodes... So, how to prevent it from happening again on a new filesystem? Using the inode64 allocator should prevent this freespace fragmentation from happening. It allocates file data in the same AG as the inode and inodes are grouped in an AG based on the parent directory location. Directory inodes are rotored across AGs to spread them out. The way it searches for free space for new files is different, too, and will tend to fill holes near to the inode before searching wider. Hence it's a much more local search, and it will fill holes created by deleting files/dirs much faster, leaving less swiss chess freespace fragmentation around. The other thing is that if you have lots of rsyncs running at once is increase the number of AGs to reduce their size. More AGs will increase allocation parallelism, reducing contention, and also reducing the size of each free space trees if freespace fragmentation does occur. Given you are tracking lots of small files, (90 million inodes so far), I'd suggest increase the number of AGs by an order of magnitude so that the size drops from 1TB down to 100GB. Even if freespace fragmentation then does occur, it is Spread over 10x the number of freespace trees, and hence will have significantly less effect on performance. FWIW, you probably also want to set allocsize=4k as well, as you don't need specualtive EOF preallocation on your workload to avoid file fragmentation.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs