On Fri, Sep 05, 2014 at 02:40:32PM +0200, Stefan Priebe - Profihost AG wrote: > > Am 05.09.2014 um 14:30 schrieb Brian Foster: > > On Fri, Sep 05, 2014 at 11:47:29AM +0200, Stefan Priebe - Profihost AG wrote: > >> Hi, > >> > >> i have a backup system running 20TB of storage having 350 million files. > >> This was working fine for month. > >> > >> But now the free space is so heavily fragmented that i only see the > >> kworker with 4x 100% CPU and write speed beeing very slow. 15TB of the > >> 20TB are in use. What does perf tell you about the CPU being burnt? (i.e run perf top for 10-20s while that CPU burn is happening and paste the top 10 CPU consuming functions). > >> > >> Overall files are 350 Million - all in different directories. Max 5000 > >> per dir. > >> > >> Kernel is 3.10.53 and mount options are: > >> noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota > >> > >> # xfs_db -r -c freesp /dev/sda1 > >> from to extents blocks pct > >> 1 1 29484138 29484138 2,16 > >> 2 3 16930134 39834672 2,92 > >> 4 7 16169985 87877159 6,45 > >> 8 15 78202543 999838327 73,41 With an inode size of 256 bytes, this is going to be your real problem soon - most of the free space is smaller than an inode chunk so soon you won't be able to allocate new inodes, even though there is free space on disk. Unfortunately, there's not much we can do about this right now - we need development in both user and kernel space to mitigate this issue: spare inode chunk allocation in kernel space, and free space defragmentation in userspace. Both are on the near term development list.... Also, the fact that there are almost 80 million 8-15 block extents indicates that the CPU burn is likely coming from the by-size free space search. We look up the first extent of the correct size, and then do a linear search for a nearest extent of that size to the target. Hence we could be searching millions of extents to find the "nearest".... > >> 16 31 3562456 83746085 6,15 > >> 32 63 2370812 102124143 7,50 > >> 64 127 280885 18929867 1,39 > >> 256 511 2 827 0,00 > >> 512 1023 65 35092 0,00 > >> 2048 4095 2 6561 0,00 > >> 16384 32767 1 23951 0,00 > >> > >> Is there anything i can optimize? Or is it just a bad idea to do this > >> with XFS? No, it's not a bad idea. In fact, if you have this sort of use case, XFS is really your only choice. In terms of optimisation, the only thing that will really help performance is the new finobt structure. That's a mkfs option andnot an in-place change, though, so it's unlikely to help. FWIW, it may also help aging characteristics of this sort of workload by improving inode allocation layout. That would be a side effect of being able to search the entire free inode tree extremely quickly rather than allocating new chunks to keep CPU time searching the allocate inode tree for free inodes down. Hence it would tend to more tightly pack inode chunks when they are allocated on disk as it will fill full chunks before allocating new ones elsewhere. > >> Any other options? Maybe rsync options like --inplace / > >> --no-whole-file? For 350M files? I doubt there's much you can really do. Any sort of large scale re-organisation is going to take a long, long time and require lots of IO. If you are goign to take that route, you'd do better to upgrade kernel and xfsprogs, then dump/mkfs.xfs -m crc=1,finobt=1/restore. And you'd probably want to use a multi-stream dump/restore so it can run operations concurrently and hence at storage speed rather than being CPU bound.... Also, if the problem really is the number of indentically sized free space fragments in the freespace btrees, then the initial solution is, again, a mkfs one. i.e. remake the filesystem with more, smaller AGs to keep the number of extents the btrees need to index down to a reasonable level. Say a couple of hundred AGs rather than 21? > > If so, I wonder if something like the > > following commit introduced in 3.12 would help: > > > > 133eeb17 xfs: don't use speculative prealloc for small files > > Looks interesting. Probably won't make any difference because backups via rsync do open/write/close and don't touch the file data again, so the close will be removing speculative preallocation before the data is written and extents are allocated by background writeback.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs