Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 5 Nov 2019 15:03:25 +1100

On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote:
> Hi Dave,
> 
> On 2019/11/5 4:49, Dave Chinner wrote:
> > On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
> >> From: Yang Guo <guoyang2@xxxxxxxxxx>
> >>
> >> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
> >> whether the counter less than 0 and it is a expensive function.
> >> let's check it only when delta < 0, it will be good for xfs's performance.
> > 
> > Hmmm. I don't recall this as being expensive.
> > 
> 
> Sorry about the misunderstanding information in commit message.
> 
> > How did you find this? Can you please always document how you found
> 
> If user creates million of files and the delete them, We found that the
> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself
> is not expensive, but it calls __percpu_counter_sum which will use
> spin_lock and read other cpu's count. perf record -g is used to profile it:
> 
> - 5.88%     0.02%  rm  [kernel.vmlinux]  [k] xfs_mod_ifree
>    - 5.86% xfs_mod_ifree
>       - 5.78% __percpu_counter_compare
>            5.61% __percpu_counter_sum

Interesting. Your workload is hitting the slow path, which I most
certainly do no see when creating lots of files. What's your
workload?

> > IOWs, we typically measure the overhead of such functions by kernel
> > profile.  Creating ~200,000 inodes a second, so hammering the icount
> > and ifree counters, I see:
> > 
> >       0.16%  [kernel]  [k] percpu_counter_add_batch
> >       0.03%  [kernel]  [k] __percpu_counter_compare
> > 
> 
> 0.03% is just __percpu_counter_compare's usage.

No, that's the total of _all_ the percpu counter functions captured
by the profile - it was the list of all samples filtered by
"percpu". I just re-ran the profile again, and got:

   0.23%  [kernel]  [k] percpu_counter_add_batch
   0.04%  [kernel]  [k] __percpu_counter_compare
   0.00%  [kernel]  [k] collect_percpu_times
   0.00%  [kernel]  [k] __handle_irq_event_percpu
   0.00%  [kernel]  [k] __percpu_counter_sum
   0.00%  [kernel]  [k] handle_irq_event_percpu
   0.00%  [kernel]  [k] fprop_reflect_period_percpu.isra.0
   0.00%  [kernel]  [k] percpu_ref_switch_to_atomic_rcu
   0.00%  [kernel]  [k] free_percpu
   0.00%  [kernel]  [k] percpu_ref_exit

So you can see that this essentially no samples in
__percpu_counter_sum at all - my tests are not hitting the slow path
at all, despite allocating inodes continuously.

IOWs, your workload is hitting the slow path repeatedly, and so the
question that needs to be answered is "why is the slow path actually
being exercised?". IOWs, we need to know what your workload is, what
the filesystem config is, what hardware (cpus, storage, etc) you are
running on, etc. There must be some reason for the slow path being
used, and that's what we need to understand first before deciding
what the best fix might be...

I suspect that you are only running one or two threads creating
files and you have lots of idle CPU and hence the inode allocation
is not clearing the fast path batch threshold on the ifree counter.
And because you have lots of CPUs, the cost of a sum is very
expensive compared to running single threaded creates. That's my
current hypothesis based what I see on my workloads that
xfs_mod_ifree overhead goes down as concurrency goes up....

FWIW, the profiles I took came from running this on 16 and 32p
machines:

--
dirs=""
for i in `seq 1 $THREADS`; do
        dirs="$dirs -d /mnt/scratch/$i"
done

cycles=$((512 / $THREADS))

time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs
--

With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem
image:

meta-data=/dev/vdc               isize=512    agcount=500, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=134217727500, imaxpct=1
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

That's allocating enough inodes to keep the free inode counter
entirely out of the slow path...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx