On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote: > Hi Dave, > > On 2019/11/5 4:49, Dave Chinner wrote: > > On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote: > >> From: Yang Guo <guoyang2@xxxxxxxxxx> > >> > >> percpu_counter_compare will be called by xfs_mod_icount/ifree to check > >> whether the counter less than 0 and it is a expensive function. > >> let's check it only when delta < 0, it will be good for xfs's performance. > > > > Hmmm. I don't recall this as being expensive. > > > > Sorry about the misunderstanding information in commit message. > > > How did you find this? Can you please always document how you found > > If user creates million of files and the delete them, We found that the > __percpu_counter_compare costed 5.78% CPU usage, you are right that itself > is not expensive, but it calls __percpu_counter_sum which will use > spin_lock and read other cpu's count. perf record -g is used to profile it: > > - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree > - 5.86% xfs_mod_ifree > - 5.78% __percpu_counter_compare > 5.61% __percpu_counter_sum Interesting. Your workload is hitting the slow path, which I most certainly do no see when creating lots of files. What's your workload? > > IOWs, we typically measure the overhead of such functions by kernel > > profile. Creating ~200,000 inodes a second, so hammering the icount > > and ifree counters, I see: > > > > 0.16% [kernel] [k] percpu_counter_add_batch > > 0.03% [kernel] [k] __percpu_counter_compare > > > > 0.03% is just __percpu_counter_compare's usage. No, that's the total of _all_ the percpu counter functions captured by the profile - it was the list of all samples filtered by "percpu". I just re-ran the profile again, and got: 0.23% [kernel] [k] percpu_counter_add_batch 0.04% [kernel] [k] __percpu_counter_compare 0.00% [kernel] [k] collect_percpu_times 0.00% [kernel] [k] __handle_irq_event_percpu 0.00% [kernel] [k] __percpu_counter_sum 0.00% [kernel] [k] handle_irq_event_percpu 0.00% [kernel] [k] fprop_reflect_period_percpu.isra.0 0.00% [kernel] [k] percpu_ref_switch_to_atomic_rcu 0.00% [kernel] [k] free_percpu 0.00% [kernel] [k] percpu_ref_exit So you can see that this essentially no samples in __percpu_counter_sum at all - my tests are not hitting the slow path at all, despite allocating inodes continuously. IOWs, your workload is hitting the slow path repeatedly, and so the question that needs to be answered is "why is the slow path actually being exercised?". IOWs, we need to know what your workload is, what the filesystem config is, what hardware (cpus, storage, etc) you are running on, etc. There must be some reason for the slow path being used, and that's what we need to understand first before deciding what the best fix might be... I suspect that you are only running one or two threads creating files and you have lots of idle CPU and hence the inode allocation is not clearing the fast path batch threshold on the ifree counter. And because you have lots of CPUs, the cost of a sum is very expensive compared to running single threaded creates. That's my current hypothesis based what I see on my workloads that xfs_mod_ifree overhead goes down as concurrency goes up.... FWIW, the profiles I took came from running this on 16 and 32p machines: -- dirs="" for i in `seq 1 $THREADS`; do dirs="$dirs -d /mnt/scratch/$i" done cycles=$((512 / $THREADS)) time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs -- With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem image: meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=134217727500, imaxpct=1 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 That's allocating enough inodes to keep the free inode counter entirely out of the slow path... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx