Re: Is XFS suitable for 350 million files on 20TB storage?

Stefan Priebe <s.priebe@xxxxxxxxxxxx> · Sat, 06 Sep 2014 09:35:15 +0200

Hi Dave,

Am 06.09.2014 01:05, schrieb Dave Chinner:
On Fri, Sep 05, 2014 at 02:40:32PM +0200, Stefan Priebe - Profihost AG wrote:

Am 05.09.2014 um 14:30 schrieb Brian Foster:
On Fri, Sep 05, 2014 at 11:47:29AM +0200, Stefan Priebe - Profihost AG wrote:
Hi,

i have a backup system running 20TB of storage having 350 million files.
This was working fine for month.

But now the free space is so heavily fragmented that i only see the
kworker with 4x 100% CPU and write speed beeing very slow. 15TB of the
20TB are in use.

What does perf tell you about the CPU being burnt? (i.e run perf top
for 10-20s while that CPU burn is happening and paste the top 10 CPU
consuming functions).

here we go:
 15,79%  [kernel]            [k] xfs_inobt_get_rec
 14,57%  [kernel]            [k] xfs_btree_get_rec
 10,37%  [kernel]            [k] xfs_btree_increment
  7,20%  [kernel]            [k] xfs_btree_get_block
  6,13%  [kernel]            [k] xfs_btree_rec_offset
  4,90%  [kernel]            [k] xfs_dialloc_ag
  3,53%  [kernel]            [k] xfs_btree_readahead
  2,87%  [kernel]            [k] xfs_btree_rec_addr
  2,80%  [kernel]            [k] _xfs_buf_find
  1,94%  [kernel]            [k] intel_idle
  1,49%  [kernel]            [k] _raw_spin_lock
  1,13%  [kernel]            [k] copy_pte_range
  1,10%  [kernel]            [k] unmap_single_vma

Overall files are 350 Million - all in different directories. Max 5000
per dir.

Kernel is 3.10.53 and mount options are:
noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota

# xfs_db -r -c freesp /dev/sda1
    from      to extents  blocks    pct
       1       1 29484138 29484138   2,16
       2       3 16930134 39834672   2,92
       4       7 16169985 87877159   6,45
       8      15 78202543 999838327  73,41

With an inode size of 256 bytes, this is going to be your real
problem soon - most of the free space is smaller than an inode
chunk so soon you won't be able to allocate new inodes, even though
there is free space on disk.

Unfortunately, there's not much we can do about this right now - we
need development in both user and kernel space to mitigate this
issue: spare inode chunk allocation in kernel space, and free space
defragmentation in userspace. Both are on the near term development
list....

Also, the fact that there are almost 80 million 8-15 block extents
indicates that the CPU burn is likely coming from the by-size free
space search. We look up the first extent of the correct size, and
then do a linear search for a nearest extent of that size to the
target. Hence we could be searching millions of extents to find the
"nearest"....

      16      31 3562456 83746085   6,15
      32      63 2370812 102124143   7,50
      64     127  280885 18929867   1,39
     256     511       2     827   0,00
     512    1023      65   35092   0,00
    2048    4095       2    6561   0,00
   16384   32767       1   23951   0,00

Is there anything i can optimize? Or is it just a bad idea to do this
with XFS?

No, it's not a bad idea. In fact, if you have this sort of use case,
XFS is really your only choice. In terms of optimisation, the only
thing that will really help performance is the new finobt structure.
That's a mkfs option andnot an in-place change, though, so it's
unlikely to help.

I've no problem with reformatting the array. I've more backups.

FWIW, it may also help aging characteristics of this sort of
workload by improving inode allocation layout. That would be
a side effect of being able to search the entire free inode tree
extremely quickly rather than allocating new chunks to keep CPU time
searching the allocate inode tree for free inodes down. Hence it
would tend to more tightly pack inode chunks when they are allocated
on disk as it will fill full chunks before allocating new ones
elsewhere.

Any other options? Maybe rsync options like --inplace /
--no-whole-file?

For 350M files? I doubt there's much you can really do. Any sort of
large scale re-organisation is going to take a long, long time and
require lots of IO. If you are goign to take that route, you'd do
better to upgrade kernel and xfsprogs, then dump/mkfs.xfs -m
crc=1,finobt=1/restore. And you'd probably want to use a
multi-stream dump/restore so it can run operations concurrently and
hence at storage speed rather than being CPU bound....

I don't need a backup reformatting is possible but i really would like 
to stay at 3.10. Is there anything i can backport or do i really need to 
upgrade? Which version at least?

Also, if the problem really is the number of indentically sized free
space fragments in the freespace btrees, then the initial solution
is, again, a mkfs one. i.e. remake the filesystem with more, smaller
AGs to keep the number of extents the btrees need to index down to a
reasonable level. Say a couple of hundred AGs rather than 21?

mkfs has chosen 21 automagically - it's nothing i've set. Is this a bug 
or do i just need it cause of my special use case.

Thanks!

Stefan

If so, I wonder if something like the
following commit introduced in 3.12 would help:

133eeb17 xfs: don't use speculative prealloc for small files

Looks interesting.

Probably won't make any difference because backups via rsync do
open/write/close and don't touch the file data again, so the close
will be removing speculative preallocation before the data is
written and extents are allocated by background writeback....

Cheers,

Dave.

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs