On 10/14/20 2:31 PM, Hao_Xu wrote: > Hi Jens, > I've done some tests for the new fix code with readahead disabled from > userspace. Here comes some results. > For the perf reports, since I'm new to kernel stuff, still investigating > on it. > I'll keep addressing the issue which causes the difference among the > four perf reports(in which the copy_user_enhanced_fast_string() catches > my eyes) > > my environment is: > server: physical server > kernel: mainline 5.9.0-rc8+ latest commit 6f2f486d57c4d562cdf4 > fs: ext4 > device: nvme ssd > fio: 3.20 > > I did the tests by setting and commenting the code: > filp->f_mode |= FMODE_BUF_RASYNC; > in fs/ext4/file.c ext4_file_open() You don't have to modify the kernel, if you use a newer fio then you can essentially just add: --force_async=1 after setting the engine to io_uring to get the same effect. Just a heads up, as that might make it easier for you. > the IOPS with readahead disabled from userspace is below: > > with new fix code(force readahead) > QD/Test FMODE_BUF_RASYNC set FMODE_BUF_RASYNC not set > 1 10.8k 10.3k > 2 21.2k 20.1k > 4 41.1k 39.1k > 8 76.1k 72.2k > 16 133k 126k > 32 169k 147k > 64 176k 160k > 128 (1)187k (2)156k > > now async buffered reads feature looks better in terms of IOPS, > but it still looks similar with the async buffered reads feature in the > mainline code. I'd say it looks better all around. And what you're completely forgetting here is that when FMODE_BUF_RASYNC isn't set, then you're using QD number of async workers to achieve that result. Hence you have 1..128 threads potentially running on that one, vs having a _single_ process running with FMODE_BUF_RASYNC. > with mainline code(the fix code in commit c8d317aa1887 ("io_uring: fix > async buffered reads when readahead is disabled")) > QD/Test FMODE_BUF_RASYNC set FMODE_BUF_RASYNC not set > 1 10.9k 10.2k > 2 21.6k 20.2k > 4 41.0k 39.9k > 8 79.7k 75.9k > 16 141k 138k > 32 169k 237k > 64 190k 316k > 128 (3)195k (4)315k > > Considering the number in place (1)(2)(3)(4), the new fix doesn't seem > to fix the slow down > but make the number (4) become number (2) Not sure why there would be a difference between 2 and 4, that does seem odd. I'll see if I can reproduce that. More questions below. > the perf reports of (1)(2)(3)(4) situations are: > (1) > 9 # Overhead Command Shared Object Symbol > 10 # ........ ....... .................. > .............................................. > 11 # > 12 10.19% fio [kernel.vmlinux] [k] > copy_user_enhanced_fast_string > 13 8.53% fio fio [.] clock_thread_fn > 14 4.67% fio [kernel.vmlinux] [k] xas_load > 15 2.18% fio [kernel.vmlinux] [k] clear_page_erms > 16 2.02% fio libc-2.24.so [.] __memset_avx2_erms > 17 1.55% fio [kernel.vmlinux] [k] mutex_unlock > 18 1.51% fio [kernel.vmlinux] [k] shmem_getpage_gfp > 19 1.48% fio [kernel.vmlinux] [k] native_irq_return_iret > 20 1.48% fio [kernel.vmlinux] [k] get_page_from_freelist > 21 1.46% fio [kernel.vmlinux] [k] generic_file_buffered_read > 22 1.45% fio [nvme] [k] nvme_irq > 23 1.25% fio [kernel.vmlinux] [k] __list_del_entry_valid > 24 1.22% fio [kernel.vmlinux] [k] free_pcppages_bulk > 25 1.15% fio [kernel.vmlinux] [k] _raw_spin_lock > 26 1.12% fio fio [.] get_io_u > 27 0.81% fio [ext4] [k] ext4_mpage_readpages > 28 0.78% fio fio [.] fio_gettime > 29 0.76% fio [kernel.vmlinux] [k] find_get_entries > 30 0.75% fio [vdso] [.] __vdso_clock_gettime > 31 0.73% fio [kernel.vmlinux] [k] release_pages > 32 0.68% fio [kernel.vmlinux] [k] find_get_entry > 33 0.68% fio fio [.] io_u_queued_complete > 34 0.67% fio [kernel.vmlinux] [k] io_async_buf_func > 35 0.65% fio [kernel.vmlinux] [k] io_submit_sqes These profiles are of marginal use, as you're only profiling fio itself, not all of the async workers that are running for !FMODE_BUF_RASYNC. How long does the test run? It looks suspect that clock_thread_fn shows up in the profiles at all. And is it actually doing IO, or are you using shm/tmpfs for this test? Isn't ext4 hosting the file? I see a lot of shmem_getpage_gfp(), makes me a little confused. -- Jens Axboe