Re: Is XFS suitable for 350 million files on 20TB storage?

Brian Foster <bfoster@xxxxxxxxxx> · Fri, 5 Sep 2014 15:18:16 -0400

On Fri, Sep 05, 2014 at 08:07:38PM +0200, Stefan Priebe wrote:
> Hi,
> 
> Am 05.09.2014 15:48, schrieb Brian Foster:
> >On Fri, Sep 05, 2014 at 02:40:32PM +0200, Stefan Priebe - Profihost AG wrote:
> >>
> >>Am 05.09.2014 um 14:30 schrieb Brian Foster:
> >>>On Fri, Sep 05, 2014 at 11:47:29AM +0200, Stefan Priebe - Profihost AG wrote:
> >>>>Hi,
> >>>>
> >>>>i have a backup system running 20TB of storage having 350 million files.
> >>>>This was working fine for month.
> >>>>
> >>>>But now the free space is so heavily fragmented that i only see the
> >>>>kworker with 4x 100% CPU and write speed beeing very slow. 15TB of the
> >>>>20TB are in use.
> >>>>
> >>>>Overall files are 350 Million - all in different directories. Max 5000
> >>>>per dir.
> >>>>
> >>>>Kernel is 3.10.53 and mount options are:
> >>>>noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota
> >>>>
> >>>># xfs_db -r -c freesp /dev/sda1
> >>>>    from      to extents  blocks    pct
> >>>>       1       1 29484138 29484138   2,16
> >>>>       2       3 16930134 39834672   2,92
> >>>>       4       7 16169985 87877159   6,45
> >>>>       8      15 78202543 999838327  73,41
> >>>>      16      31 3562456 83746085   6,15
> >>>>      32      63 2370812 102124143   7,50
> >>>>      64     127  280885 18929867   1,39
> >>>>     256     511       2     827   0,00
> >>>>     512    1023      65   35092   0,00
> >>>>    2048    4095       2    6561   0,00
> >>>>   16384   32767       1   23951   0,00
> >>>>
> >>>>Is there anything i can optimize? Or is it just a bad idea to do this
> >>>>with XFS? Any other options? Maybe rsync options like --inplace /
> >>>>--no-whole-file?
> >>>>
> >>>
> >>>It's probably a good idea to include more information about your fs:
> >>>
> >>>http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
> >>
> >>Generally sure but the problem itself is clear. If you look at the free
> >>space allocation you see that free space is heavily fragmented.
> >>
> >>But here you go:
> >>- 3.10.53 vanilla
> >>- xfs_repair version 3.1.11
> >>- 16 cores
> >>- /dev/sda1 /backup xfs
> >>rw,noatime,nodiratime,attr2,inode64,logbufs=8,logbsize=256k,noquota 0 0
> >>- Raid 10 with 1GB controller cache running in write back mode using 24
> >>spinners
> >>- no lvm
> >>- no io waits
> >>- xfs_info /serverbackup/
> >>meta-data=/dev/sda1              isize=256    agcount=21,
> >>agsize=268435455 blks
> >>          =                       sectsz=512   attr=2
> >>data     =                       bsize=4096   blocks=5369232896, imaxpct=5
> >>          =                       sunit=0      swidth=0 blks
> >>naming   =version 2              bsize=4096   ascii-ci=0
> >>log      =internal               bsize=4096   blocks=521728, version=2
> >>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> >>realtime =none                   extsz=4096   blocks=0, rtextents=0
> >>
> >>anything missing?
> >>
> >
> >What's the workload to the fs? Is it repeated rsync's from a constantly
> >changing dataset? Do the files change frequently or are they only ever
> >added/removed?
> 
> Yes it repeated rsync with constant changing files. About 10-20% of all
> files every week. A mixture of changing, removing / adding.
> 

Ok.

> >Also, what is the characterization of writes being "slow?" An rsync is
> >slower than normal? Sustained writes to a single file? How significant a
> >degradation?
> 
> kworker is using all cpu while writing data to this xfs partition. rsync can
> just write at a rate of 32-128kb/s.
> 

Do you have a baseline? This seems highly subjective. By that I mean
this could be slower for copying a lot of little files, faster if you
happen to copy a single large file, etc.

> >Something like the following might be interesting as well:
> >for i in $(seq 0 20); do xfs_db -c "agi $i" -c "p freecount" <dev>; done
> freecount = 3189417
> freecount = 1975726
> freecount = 1309903
> freecount = 1726846
> freecount = 1271047
> freecount = 1281956
> freecount = 1571285
> freecount = 1365473
> freecount = 1238118
> freecount = 1697011
> freecount = 1000832
> freecount = 1369791
> freecount = 1706360
> freecount = 1439165
> freecount = 1656404
> freecount = 1881762
> freecount = 1593432
> freecount = 1555909
> freecount = 1197091
> freecount = 1667467
> freecount = 63
> 

Interesting, that seems like a lot of free inodes. That's 1-2 million in
each AG that we have to look around for each time we want to allocate an
inode. I can't say for sure that's the source of the slowdown, but this
certainly looks like the kind of workload that inspired the addition of
the free inode btree (finobt) to more recent kernels.

It appears that you still have quite a bit of space available in
general. Could you run some local tests on this filesystem to try and
quantify how much of this degradation manifests on sustained writes vs.
file creation? For example, how is throughput when writing a few GB to a
local test file? How about with that same amount of data broken up
across a few thousand files?

Brian

P.S., Alternatively if you wanted to grab a metadump of this filesystem
and compress/upload it somewhere, I'd be interested to take a look at
it.

> Thanks!
> 
> Stefan
> 
> 
> 
> >Brian
> >
> >>>... as well as what your typical workflow/dataset is for this fs. It
> >>>seems like you have relatively small files (15TB used across 350m files
> >>>is around 46k per file), yes?
> >>
> >>Yes - most fo them are even smaller. And some files are > 5GB.
> >>
> >>>If so, I wonder if something like the
> >>>following commit introduced in 3.12 would help:
> >>>
> >>>133eeb17 xfs: don't use speculative prealloc for small files
> >>
> >>Looks interesting.
> >>
> >>Stefan

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs