Hi Dave, On 2019/11/5 12:03, Dave Chinner wrote: > On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote: >> Hi Dave, >> >> On 2019/11/5 4:49, Dave Chinner wrote: >>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote: >>>> From: Yang Guo <guoyang2@xxxxxxxxxx> >>>> >>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check >>>> whether the counter less than 0 and it is a expensive function. >>>> let's check it only when delta < 0, it will be good for xfs's performance. >>> >>> Hmmm. I don't recall this as being expensive. >>> >> >> Sorry about the misunderstanding information in commit message. >> >>> How did you find this? Can you please always document how you found >> >> If user creates million of files and the delete them, We found that the >> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself >> is not expensive, but it calls __percpu_counter_sum which will use >> spin_lock and read other cpu's count. perf record -g is used to profile it: >> >> - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree >> - 5.86% xfs_mod_ifree >> - 5.78% __percpu_counter_compare >> 5.61% __percpu_counter_sum > > Interesting. Your workload is hitting the slow path, which I most > certainly do no see when creating lots of files. What's your > workload? > The hardware has 128 cpu cores, and the xfs filesystem format config is default, while the test is a single thread, as follow: ./mdtest -I 10 -z 6 -b 8 -d /mnt/ -t -c 2 xfs info: meta-data=/dev/bcache2 isize=512 agcount=4, agsize=244188661 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1 spinodes=1 rmapbt=0 = reflink=0 data = bsize=4096 blocks=976754644, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=476930, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 disk info: Disk /dev/bcache2: 4000.8 GB, 4000787021824 bytes, 7814037152 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes >>> IOWs, we typically measure the overhead of such functions by kernel >>> profile. Creating ~200,000 inodes a second, so hammering the icount >>> and ifree counters, I see: >>> >>> 0.16% [kernel] [k] percpu_counter_add_batch >>> 0.03% [kernel] [k] __percpu_counter_compare >>> >> >> 0.03% is just __percpu_counter_compare's usage. > > No, that's the total of _all_ the percpu counter functions captured > by the profile - it was the list of all samples filtered by > "percpu". I just re-ran the profile again, and got: > > > 0.23% [kernel] [k] percpu_counter_add_batch > 0.04% [kernel] [k] __percpu_counter_compare > 0.00% [kernel] [k] collect_percpu_times > 0.00% [kernel] [k] __handle_irq_event_percpu > 0.00% [kernel] [k] __percpu_counter_sum > 0.00% [kernel] [k] handle_irq_event_percpu > 0.00% [kernel] [k] fprop_reflect_period_percpu.isra.0 > 0.00% [kernel] [k] percpu_ref_switch_to_atomic_rcu > 0.00% [kernel] [k] free_percpu > 0.00% [kernel] [k] percpu_ref_exit > > So you can see that this essentially no samples in > __percpu_counter_sum at all - my tests are not hitting the slow path > at all, despite allocating inodes continuously. Got it, > > IOWs, your workload is hitting the slow path repeatedly, and so the > question that needs to be answered is "why is the slow path actually > being exercised?". IOWs, we need to know what your workload is, what > the filesystem config is, what hardware (cpus, storage, etc) you are > running on, etc. There must be some reason for the slow path being > used, and that's what we need to understand first before deciding > what the best fix might be... > > I suspect that you are only running one or two threads creating Yeah, we just run one thread test. > files and you have lots of idle CPU and hence the inode allocation > is not clearing the fast path batch threshold on the ifree counter. > And because you have lots of CPUs, the cost of a sum is very > expensive compared to running single threaded creates. That's my > current hypothesis based what I see on my workloads that > xfs_mod_ifree overhead goes down as concurrency goes up.... > Agree, we add some debug info in xfs_mod_ifree and found most times m_ifree.count < batch * num_online_cpus(), because we have 128 online cpus and m_ifree.count around 999. > FWIW, the profiles I took came from running this on 16 and 32p > machines: > > -- > dirs="" > for i in `seq 1 $THREADS`; do > dirs="$dirs -d /mnt/scratch/$i" > done > > cycles=$((512 / $THREADS)) > > time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs > -- > > With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem > image: > > meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=0 > = reflink=1 > data = bsize=4096 blocks=134217727500, imaxpct=1 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=521728, version=2 > = sectsz=512 sunit=0 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > That's allocating enough inodes to keep the free inode counter > entirely out of the slow path... percpu_counter_read that reads the count will cause cache synchronization cost if other cpu changes the count, Maybe it's better not to call percpu_counter_compare if possible. Thanks, Shaokun > > Cheers, > > Dave. >