Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 7 Nov 2019 08:20:41 +1100

On Wed, Nov 06, 2019 at 02:00:58PM +0800, Shaokun Zhang wrote:
> Hi Dave,
> 
> On 2019/11/5 12:03, Dave Chinner wrote:
> > On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote:
> >> Hi Dave,
> >>
> >> On 2019/11/5 4:49, Dave Chinner wrote:
> >>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
> >>>> From: Yang Guo <guoyang2@xxxxxxxxxx>
> >>>>
> >>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
> >>>> whether the counter less than 0 and it is a expensive function.
> >>>> let's check it only when delta < 0, it will be good for xfs's performance.
> >>>
> >>> Hmmm. I don't recall this as being expensive.
> >>>
> >>
> >> Sorry about the misunderstanding information in commit message.
> >>
> >>> How did you find this? Can you please always document how you found
> >>
> >> If user creates million of files and the delete them, We found that the
> >> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself
> >> is not expensive, but it calls __percpu_counter_sum which will use
> >> spin_lock and read other cpu's count. perf record -g is used to profile it:
> >>
> >> - 5.88%     0.02%  rm  [kernel.vmlinux]  [k] xfs_mod_ifree
> >>    - 5.86% xfs_mod_ifree
> >>       - 5.78% __percpu_counter_compare
> >>            5.61% __percpu_counter_sum
> > 
> > Interesting. Your workload is hitting the slow path, which I most
> > certainly do no see when creating lots of files. What's your
> > workload?
> > 
> 
> The hardware has 128 cpu cores, and the xfs filesystem format config is default,
> while the test is a single thread, as follow:
> ./mdtest -I 10  -z 6 -b 8 -d /mnt/ -t -c 2

What version and where do I get it?

Hmmm - isn't mdtest a MPI benchmark intended for highly concurrent
metadata workload testing? How representative is it of your actual
production workload? Is that single threaded?

> xfs info:
> meta-data=/dev/bcache2           isize=512    agcount=4, agsize=244188661 blks

only 4 AGs, which explains the lack of free inodes - there isn't
enough concurrency in the filesystem layout to push the free inode
count in all AGs beyond the batchsize * num_online_cpus().

i.e. single threaded workloads typically drain the free inode count
all the way down to zero before new inodes are allocated. Workloads
that are highly concurrent allocate from lots of AGs at once,
leaving free inodes in every AG that is not current being actively
allocated out of.

As a test, can you remake that test filesystem with "-d agcount=32"
and see if the overhead you are seeing disappears?

> > files and you have lots of idle CPU and hence the inode allocation
> > is not clearing the fast path batch threshold on the ifree counter.
> > And because you have lots of CPUs, the cost of a sum is very
> > expensive compared to running single threaded creates. That's my
> > current hypothesis based what I see on my workloads that
> > xfs_mod_ifree overhead goes down as concurrency goes up....
> > 
> 
> Agree, we add some debug info in xfs_mod_ifree and found most times
> m_ifree.count < batch * num_online_cpus(),  because we have 128 online
> cpus and m_ifree.count around 999.

Ok, the threshold is 32 * 128 = ~4000 to get out of the slow
path. 32 AGs may well push the count over this threshold, so it's
definitely worth trying....

> > FWIW, the profiles I took came from running this on 16 and 32p
> > machines:
> > 
> > --
> > dirs=""
> > for i in `seq 1 $THREADS`; do
> >         dirs="$dirs -d /mnt/scratch/$i"
> > done
> > 
> > cycles=$((512 / $THREADS))
> > 
> > time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs
> > --
> > 
> > With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem
> > image:
> > 
> > meta-data=/dev/vdc               isize=512    agcount=500, agsize=268435455 blks
> >          =                       sectsz=512   attr=2, projid32bit=1
> >          =                       crc=1        finobt=1, sparse=1, rmapbt=0
> >          =                       reflink=1
> > data     =                       bsize=4096   blocks=134217727500, imaxpct=1
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> > log      =internal log           bsize=4096   blocks=521728, version=2
> >          =                       sectsz=512   sunit=0 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > 
> > That's allocating enough inodes to keep the free inode counter
> > entirely out of the slow path...
> 
> percpu_counter_read that reads the count will cause cache synchronization
> cost if other cpu changes the count, Maybe it's better not to call
> percpu_counter_compare if possible.

Depends.  Sometimes we trade off ultimate single threaded
performance and efficiency for substantially better scalability.
i.e. if we lose 5% on single threaded performance but gain 10x on
concurrent workloads, then that is a good tradeoff to make.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx