On Fri, Nov 15, 2019 at 09:08:24AM +0800, Ming Lei wrote: > Hi Dave, > > On Fri, Nov 15, 2019 at 10:54:15AM +1100, Dave Chinner wrote: > > On Thu, Nov 14, 2019 at 07:31:53PM +0800, Ming Lei wrote: > > > Hi Guys, > > > > > > It is found that single AIO thread is migrated crazely by scheduler, and > > > the migrate period can be < 10ms. Follows the test a): > > > > > > - run single job fio[1] for 30 seconds: > > > ./xfs_complete 512 > > > > > > - observe fio io thread migration via bcc trace[2], and the migration > > > times can reach 5k ~ 10K in above test. In this test, CPU utilization > > > is 30~40% on the CPU running fio IO thread. > > > > Using the default scheduler tunings: > > > > kernel.sched_wakeup_granularity_ns = 4000000 > > kernel.sched_min_granularity_ns = 3000000 > > > > I'm not seeing any migrations at all on a 16p x86-64 box. Even with > > the tunings you suggest: > > > > sysctl kernel.sched_min_granularity_ns=10000000 > > sysctl kernel.sched_wakeup_granularity_ns=15000000 > > > > There are no migrations at all. > > Looks I forget to pass $BS to the fio command line in the script posted, > please try the following script again and run './xfs_complete 512' first. So I ran 4kB IOs instead of 512 byte IOs. Shouldn't make any difference, really - it'll still be CPU bound... <snip script> > In my test just done, the migration count is 12K in 30s fio running. > Sometimes the number can be quite less, but most of times, the number > is very big(> 5k). With my iomap-dio-overwrite patch and 512 byte IOs: $ sudo trace-cmd show |grep sched_migrate_task |wc -l 112 $ sudo trace-cmd show |grep sched_migrate_task |grep fio |wc -l 22 Without the iomap-dio-overwrite patch: $ sudo trace-cmd show |grep sched_migrate_task |wc -l 99 $ sudo trace-cmd show |grep sched_migrate_task |grep fio |wc -l 9 $ There are -less- migrations when using the workqueue for everything. But it's so low in either case that it's just noise. Performance is identical for the two patches... > > > BTW, the tests are run on latest linus tree(5.4-rc7) in KVM guest, and the > > > fio test is created for simulating one real performance report which is > > > proved to be caused by frequent aio submission thread migration. > > > > What is the underlying hardware? I'm running in a 16p KVM guest on a > > 16p/32t x86-64 using 5.4-rc7, and I don't observe any significant > > CPU migration occurring at all from your test workload. > > It is a KVM guest, which is running on my Lenova T460p Fedora 29 laptop, > and the host kernel is 5.2.18-100.fc29.x86_64, follows the guest info: Ok, so what are all the custom distro kernel tunings that userspace does for the kernel? > [root@ktest-01 ~]# lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 8 > On-line CPU(s) list: 0-7 > Thread(s) per core: 1 > Core(s) per socket: 4 > Socket(s): 2 > NUMA node(s): 2 Curious. You've configured it as two CPU sockets. If you make it a single socket, do your delay problems go away? The snippet of trace output you showed indicated it bouncing around CPUs on a single node (cpus 0-3), so maybe it has something to do with way the scheduler is interacting with non-zero NUMA distances... > Vendor ID: GenuineIntel > CPU family: 6 > Model: 94 > Model name: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz > Stepping: 3 > CPU MHz: 2712.000 > BogoMIPS: 5424.00 > Virtualization: VT-x > Hypervisor vendor: KVM > Virtualization type: full > L1d cache: 32K > L1i cache: 32K > L2 cache: 4096K > L3 cache: 16384K > NUMA node0 CPU(s): 0-3 > NUMA node1 CPU(s): 4-7 > Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmxp That seems like a very minimal set of CPU flags - looks like you are not actually passing the actual host CPU capabilities through to the guest. That means it will be doing the slowest, most generic spectre/meltdown mitigations, right? Also, shouldn't lscpu be telling us all the CPU bug mitigations in place? >From my test system: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 40 bits physical, 48 bits virtual CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 16 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 45 Model name: Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz Stepping: 7 CPU MHz: 2199.998 BogoMIPS: 4399.99 Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 512 KiB L1i cache: 512 KiB L2 cache: 64 MiB L3 cache: 256 MiB NUMA node0 CPU(s): 0-15 Vulnerability L1tf: Mitigation; PTE Inversion; VMX flush not necessary, SMT disabled Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown Vulnerability Meltdown: Vulnerable Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp l m constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm cpuid_fault ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpi d tsc_adjust xsaveopt arat umip md_clear arch_capabilities So, to rule out that it has something to do with kernel config, I just ran up a kernel built with your config.gz, and the problem does not manifest. The only difference was a few drivers I needed to boot my test VMs, and I was previously not using paravirt spinlocks. So, I still can't reproduce the problem. Indeed, the workload gets nowhere near single CPU bound with your config - it's using half the CPU for the same performance: %Cpu2 : 19.8 us, 28.2 sy, 0.0 ni, 0.0 id, 52.0 wa, 0.0 hi, 0.0 %si, 0.0 st Basically, it's spending half it's time waiting on IO. If I wind the delay down to 1000ns: %Cpu1 : 42.2 us, 42.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 15.6 %si, 0.0 st it spends an awful lot of time in soft-interrupt, but is back to being CPU bound. Despite this, I still don't see any significant amount of task migration. In fact, I see a lot less with your kernel config that I do with my original kernel config, because the CPU load was far lower. > Just run a quick test several times after applying the above patch, and looks it > does make a big difference in test './xfs_complete 512' wrt. fio io thread migration. There's something very different about your system, and it doesn't appear to be a result of the kernel code itself. I think you're going to have to do all the testing at the moment, Ming, because it's clear that my test systems do not show up the problems even when using the same kernel config as you do... If you reconfig you kvm setup to pass all the native host side cpu flags through to the guest, does the problem go away? I think adding "-cpu host" to your qemu command line will do that... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx