On Fri, Nov 08, 2019 at 01:58:56PM +0800, Shaokun Zhang wrote: > Hi Dave, > > On 2019/11/7 5:20, Dave Chinner wrote: > > On Wed, Nov 06, 2019 at 02:00:58PM +0800, Shaokun Zhang wrote: > >> Hi Dave, > >> > >> On 2019/11/5 12:03, Dave Chinner wrote: > >>> On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote: > >>>> Hi Dave, > >>>> > >>>> On 2019/11/5 4:49, Dave Chinner wrote: > >>>>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote: > >>>>>> From: Yang Guo <guoyang2@xxxxxxxxxx> > >>>>>> > >>>>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check > >>>>>> whether the counter less than 0 and it is a expensive function. > >>>>>> let's check it only when delta < 0, it will be good for xfs's performance. > >>>>> > >>>>> Hmmm. I don't recall this as being expensive. > >>>>> > >>>> > >>>> Sorry about the misunderstanding information in commit message. > >>>> > >>>>> How did you find this? Can you please always document how you found > >>>> > >>>> If user creates million of files and the delete them, We found that the > >>>> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself > >>>> is not expensive, but it calls __percpu_counter_sum which will use > >>>> spin_lock and read other cpu's count. perf record -g is used to profile it: > >>>> > >>>> - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree > >>>> - 5.86% xfs_mod_ifree > >>>> - 5.78% __percpu_counter_compare > >>>> 5.61% __percpu_counter_sum > >>> > >>> Interesting. Your workload is hitting the slow path, which I most > >>> certainly do no see when creating lots of files. What's your > >>> workload? > >>> > >> > >> The hardware has 128 cpu cores, and the xfs filesystem format config is default, > >> while the test is a single thread, as follow: > >> ./mdtest -I 10 -z 6 -b 8 -d /mnt/ -t -c 2 > > > > What version and where do I get it? > > You can get the mdtest from github: https://github.com/LLNL/mdtest. Oh what a pain. $ MPI_CC=mpicc make mpicc -DLinux -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D__USE_LARGEFILE64=1 -g -o mdtest mdtest.c -lm mdtest.c:32:10: fatal error: mpi.h: No such file or directory 32 | #include "mpi.h" | ^~~~~~~ compilation terminated. make: *** [Makefile:59: mdtest] Error 1 It doesn't build with a modern linux distro openmpi installation. /me looks at the readme. Oh, it's an unmaintained historical archive. Hmmm, it got forked to https://github.com/MDTEST-LANL/mdtest. That's an unmaintained archive, too. Oh, there's a version at https://github.com/hpc/ior Ngggh. $./configure .... checking for mpicc... mpicc checking for gcc... (cached) mpicc checking whether the C compiler works... no configure: error: in `/home/dave/src/ior': configure: error: C compiler cannot create executables See `config.log' for more details $ $ less config.log .... configure:3805: checking whether the C compiler works configure:3827: mpicc conftest.c >&5 /usr/bin/ld: cannot find -lopen-rte /usr/bin/ld: cannot find -lopen-pal /usr/bin/ld: cannot find -lhwloc /usr/bin/ld: cannot find -levent /usr/bin/ld: cannot find -levent_pthreads collect2: error: ld returned 1 exit status configure:3831: $? = 1 ..... So, the mpicc compiler wrapper uses a bunch of libraries that don't get installed with the compiler wrapper. Yay? > > Hmmm - isn't mdtest a MPI benchmark intended for highly concurrent > > metadata workload testing? How representative is it of your actual > > production workload? Is that single threaded? > > > > We just use mdtest to test the performance of a file system, it can't representative > the actual workload and it's single threaded. But we also find that it goes to slow > path when we remove a dir with many files. The cmd is below: > rm -rf xxx. The benchmark doesn't reproduce that. It will only occur when you do sequential inode remove so you have no free inodes in the filesytem and every set of 64 inodes that is remove land in the same chunk and so are immediately freed, leaving zero inodes in the filesyste, i.e. this is a result of sequential inode create, which typically implies "empty filesystem". Aged filesystems typically don't behave like this - they have free indoes distributed all through the inode chunks on disk. And if you do random remove, you won't see it for the same reason, either - the remove phase of mdtest doesn't show any CPU usage in percpu_counter_sum() at all. Please understand taht I'm not saying that it isn't a problem, that it doesn't happen, or that we don't need to change the bahviour. What I'm trying to do is understand the conditions in which this problem occurs and whether they are real world or just micro-benchmark related problems... > Because percpu_counter_batch was initialized to 256 when there are 128 cpu cores. > Then we change the agcount=1024, and it also goes to slow path frequently because > mostly there are no 32768 free inodes. Hmmm. I had forgotten the batch size increased with CPU count like that - I had the thought it was log(ncpus), not linear(ncpus). That kinda points out that scaling the batch size by CPU count is somewhat silly, as concurrency on the counter is not defined by the number of CPUs in the system. INode counter concurrency is defined by the number of inode allocation/free operations that can be performed at any given time. As such, having a counter threshold of 256 * num_cpus doesn't make a whole lot of sense for a 4 AG filesystem because you can't have 128 CPUs all banging on it at once. And even if you have hundreds of AGs, the CPUs are going to be somewhat serialised through the transaction commit path by the CIL locks that are taken to insert objects into the CIL. Hence the unbound concurrency of transaction commit is going to be soaked up by the CIL list insert spin lock, and it's mostly going to be just a dribble of CPUs at a time through the transaction commit accounting code. Ok, so maybe we just need a small batch size here, like a value of 4 or 8 just to avoid every inode alloc/free transaction having to pick up a global spin lock every time... Let me do some testing here... > >> percpu_counter_read that reads the count will cause cache synchronization > >> cost if other cpu changes the count, Maybe it's better not to call > >> percpu_counter_compare if possible. > > > > Depends. Sometimes we trade off ultimate single threaded > > performance and efficiency for substantially better scalability. > > i.e. if we lose 5% on single threaded performance but gain 10x on > > concurrent workloads, then that is a good tradeoff to make. > > > > Agree, I mean that when delta > 0, there is no need to call percpu_counter_compare in > xfs_mod_ifree/icount. It also doesn't totally avoid the issue, either, because of the way we dynamically alloc/free inode clusters and so the counters go both up and down during conmtinuous allocation or continuous freeing. i.e. The pattern we see on inode allocation is: icount += 64 ifree += 64 ifree-- .... ifree-- icount += 64 ifree += 64 ifree-- .... ifree-- And on inode freeing, we see the opposite: ifree++ .... ifree++ icount -= 64 ifree -= 64 ifree++ .... ifree++ icount -= 64 ifree -= 64 So the the values of ifree will still decrement during both free and allocation operations, hence it doesn't avoid the issue totally. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx