Re: Is XFS suitable for 350 million files on 20TB storage?

Brian Foster <bfoster@xxxxxxxxxx> · Sat, 6 Sep 2014 11:04:13 -0400

On Sat, Sep 06, 2014 at 09:35:15AM +0200, Stefan Priebe wrote:
> Hi Dave,
> 
> Am 06.09.2014 01:05, schrieb Dave Chinner:
> >On Fri, Sep 05, 2014 at 02:40:32PM +0200, Stefan Priebe - Profihost AG wrote:
> >>
> >>Am 05.09.2014 um 14:30 schrieb Brian Foster:
> >>>On Fri, Sep 05, 2014 at 11:47:29AM +0200, Stefan Priebe - Profihost AG wrote:
> >>>>Hi,
> >>>>
> >>>>i have a backup system running 20TB of storage having 350 million files.
> >>>>This was working fine for month.
> >>>>
> >>>>But now the free space is so heavily fragmented that i only see the
> >>>>kworker with 4x 100% CPU and write speed beeing very slow. 15TB of the
> >>>>20TB are in use.
> >
> >What does perf tell you about the CPU being burnt? (i.e run perf top
> >for 10-20s while that CPU burn is happening and paste the top 10 CPU
> >consuming functions).
> 
> here we go:
>  15,79%  [kernel]            [k] xfs_inobt_get_rec
>  14,57%  [kernel]            [k] xfs_btree_get_rec
>  10,37%  [kernel]            [k] xfs_btree_increment
>   7,20%  [kernel]            [k] xfs_btree_get_block
>   6,13%  [kernel]            [k] xfs_btree_rec_offset
>   4,90%  [kernel]            [k] xfs_dialloc_ag
>   3,53%  [kernel]            [k] xfs_btree_readahead
>   2,87%  [kernel]            [k] xfs_btree_rec_addr
>   2,80%  [kernel]            [k] _xfs_buf_find
>   1,94%  [kernel]            [k] intel_idle
>   1,49%  [kernel]            [k] _raw_spin_lock
>   1,13%  [kernel]            [k] copy_pte_range
>   1,10%  [kernel]            [k] unmap_single_vma
> 

The top 6 or so items look related to inode allocation, so that probably
confirms the primary bottleneck as searching around for free inodes out
of the existing inode chunks, precisely what the finobt is intended to
resolve. That was introduced in 3.16 kernels, so unfortunately it is not
available in 3.10.

Brian

> >>>>
> >>>>Overall files are 350 Million - all in different directories. Max 5000
> >>>>per dir.
> >>>>
> >>>>Kernel is 3.10.53 and mount options are:
> >>>>noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota
> >>>>
> >>>># xfs_db -r -c freesp /dev/sda1
> >>>>    from      to extents  blocks    pct
> >>>>       1       1 29484138 29484138   2,16
> >>>>       2       3 16930134 39834672   2,92
> >>>>       4       7 16169985 87877159   6,45
> >>>>       8      15 78202543 999838327  73,41
> >
> >With an inode size of 256 bytes, this is going to be your real
> >problem soon - most of the free space is smaller than an inode
> >chunk so soon you won't be able to allocate new inodes, even though
> >there is free space on disk.
> >
> >Unfortunately, there's not much we can do about this right now - we
> >need development in both user and kernel space to mitigate this
> >issue: spare inode chunk allocation in kernel space, and free space
> >defragmentation in userspace. Both are on the near term development
> >list....
> >
> >Also, the fact that there are almost 80 million 8-15 block extents
> >indicates that the CPU burn is likely coming from the by-size free
> >space search. We look up the first extent of the correct size, and
> >then do a linear search for a nearest extent of that size to the
> >target. Hence we could be searching millions of extents to find the
> >"nearest"....
> >
> >>>>      16      31 3562456 83746085   6,15
> >>>>      32      63 2370812 102124143   7,50
> >>>>      64     127  280885 18929867   1,39
> >>>>     256     511       2     827   0,00
> >>>>     512    1023      65   35092   0,00
> >>>>    2048    4095       2    6561   0,00
> >>>>   16384   32767       1   23951   0,00
> >>>>
> >>>>Is there anything i can optimize? Or is it just a bad idea to do this
> >>>>with XFS?
> >
> >No, it's not a bad idea. In fact, if you have this sort of use case,
> >XFS is really your only choice. In terms of optimisation, the only
> >thing that will really help performance is the new finobt structure.
> >That's a mkfs option andnot an in-place change, though, so it's
> >unlikely to help.
> 
> I've no problem with reformatting the array. I've more backups.
> 
> >FWIW, it may also help aging characteristics of this sort of
> >workload by improving inode allocation layout. That would be
> >a side effect of being able to search the entire free inode tree
> >extremely quickly rather than allocating new chunks to keep CPU time
> >searching the allocate inode tree for free inodes down. Hence it
> >would tend to more tightly pack inode chunks when they are allocated
> >on disk as it will fill full chunks before allocating new ones
> >elsewhere.
> >
> >>>>Any other options? Maybe rsync options like --inplace /
> >>>>--no-whole-file?
> >
> >For 350M files? I doubt there's much you can really do. Any sort of
> >large scale re-organisation is going to take a long, long time and
> >require lots of IO. If you are goign to take that route, you'd do
> >better to upgrade kernel and xfsprogs, then dump/mkfs.xfs -m
> >crc=1,finobt=1/restore. And you'd probably want to use a
> >multi-stream dump/restore so it can run operations concurrently and
> >hence at storage speed rather than being CPU bound....
> 
> I don't need a backup reformatting is possible but i really would like to
> stay at 3.10. Is there anything i can backport or do i really need to
> upgrade? Which version at least?
> 
> >Also, if the problem really is the number of indentically sized free
> >space fragments in the freespace btrees, then the initial solution
> >is, again, a mkfs one. i.e. remake the filesystem with more, smaller
> >AGs to keep the number of extents the btrees need to index down to a
> >reasonable level. Say a couple of hundred AGs rather than 21?
> 
> mkfs has chosen 21 automagically - it's nothing i've set. Is this a bug or
> do i just need it cause of my special use case.
> 
> Thanks!
> 
> Stefan
> 
> >>>If so, I wonder if something like the
> >>>following commit introduced in 3.12 would help:
> >>>
> >>>133eeb17 xfs: don't use speculative prealloc for small files
> >>
> >>Looks interesting.
> >
> >Probably won't make any difference because backups via rsync do
> >open/write/close and don't touch the file data again, so the close
> >will be removing speculative preallocation before the data is
> >written and extents are allocated by background writeback....
> >
> >Cheers,
> >
> >Dave.
> >
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs