Re: Loophole in async page I/O

Hao_Xu <haoxu@xxxxxxxxxxxxxxxxx> · Thu, 15 Oct 2020 19:27:09 +0800

在 2020/10/15 上午4:57, Jens Axboe 写道:
On 10/14/20 2:31 PM, Hao_Xu wrote:
Hi Jens,
I've done some tests for the new fix code with readahead disabled from
userspace. Here comes some results.
For the perf reports, since I'm new to kernel stuff, still investigating
on it.
I'll keep addressing the issue which causes the difference among the
four perf reports(in which the  copy_user_enhanced_fast_string() catches
my eyes)

my environment is:
      server: physical server
      kernel: mainline 5.9.0-rc8+ latest commit 6f2f486d57c4d562cdf4
      fs: ext4
      device: nvme ssd
      fio: 3.20

I did the tests by setting and commenting the code:
      filp->f_mode |= FMODE_BUF_RASYNC;
in fs/ext4/file.c ext4_file_open()

You don't have to modify the kernel, if you use a newer fio then you can
essentially just add:

--force_async=1

after setting the engine to io_uring to get the same effect. Just a
heads up, as that might make it easier for you.

the IOPS with readahead disabled from userspace is below:

with new fix code(force readahead)
QD/Test        FMODE_BUF_RASYNC set    FMODE_BUF_RASYNC not set
1                    10.8k                  10.3k
2                    21.2k                  20.1k
4                    41.1k                  39.1k
8                    76.1k                  72.2k
16                   133k                   126k
32                   169k                   147k
64                   176k                   160k
128                  (1)187k                (2)156k

now async buffered reads feature looks better in terms of IOPS,
but it still looks similar with the async buffered reads feature in the
mainline code.

I'd say it looks better all around. And what you're completely
forgetting here is that when FMODE_BUF_RASYNC isn't set, then you're
using QD number of async workers to achieve that result. Hence you have
1..128 threads potentially running on that one, vs having a _single_
process running with FMODE_BUF_RASYNC.
I totally agree with this, the server I use has many cpus which makes 
the multiple async workers works exactly parallelly.

with mainline code(the fix code in commit c8d317aa1887 ("io_uring: fix
async buffered reads when readahead is disabled"))
QD/Test        FMODE_BUF_RASYNC set    FMODE_BUF_RASYNC not set
1                       10.9k            10.2k
2                       21.6k            20.2k
4                       41.0k            39.9k
8                       79.7k            75.9k
16                      141k             138k
32                      169k             237k
64                      190k             316k
128                     (3)195k          (4)315k

Considering the number in place (1)(2)(3)(4), the new fix doesn't seem
to fix the slow down
but make the number (4) become number (2)

Not sure why there would be a difference between 2 and 4, that does seem
odd. I'll see if I can reproduce that. More questions below.

the perf reports of (1)(2)(3)(4) situations are:
(1)
    9 # Overhead  Command  Shared Object       Symbol
   10 # ........  .......  ..................
..............................................
   11 #
   12     10.19%  fio      [kernel.vmlinux]    [k]
copy_user_enhanced_fast_string
   13      8.53%  fio      fio                 [.] clock_thread_fn
   14      4.67%  fio      [kernel.vmlinux]    [k] xas_load
   15      2.18%  fio      [kernel.vmlinux]    [k] clear_page_erms
   16      2.02%  fio      libc-2.24.so        [.] __memset_avx2_erms
   17      1.55%  fio      [kernel.vmlinux]    [k] mutex_unlock
   18      1.51%  fio      [kernel.vmlinux]    [k] shmem_getpage_gfp
   19      1.48%  fio      [kernel.vmlinux]    [k] native_irq_return_iret
   20      1.48%  fio      [kernel.vmlinux]    [k] get_page_from_freelist
   21      1.46%  fio      [kernel.vmlinux]    [k] generic_file_buffered_read
   22      1.45%  fio      [nvme]              [k] nvme_irq
   23      1.25%  fio      [kernel.vmlinux]    [k] __list_del_entry_valid
   24      1.22%  fio      [kernel.vmlinux]    [k] free_pcppages_bulk
   25      1.15%  fio      [kernel.vmlinux]    [k] _raw_spin_lock
   26      1.12%  fio      fio                 [.] get_io_u
   27      0.81%  fio      [ext4]              [k] ext4_mpage_readpages
   28      0.78%  fio      fio                 [.] fio_gettime
   29      0.76%  fio      [kernel.vmlinux]    [k] find_get_entries
   30      0.75%  fio      [vdso]              [.] __vdso_clock_gettime
   31      0.73%  fio      [kernel.vmlinux]    [k] release_pages
   32      0.68%  fio      [kernel.vmlinux]    [k] find_get_entry
   33      0.68%  fio      fio                 [.] io_u_queued_complete
   34      0.67%  fio      [kernel.vmlinux]    [k] io_async_buf_func
   35      0.65%  fio      [kernel.vmlinux]    [k] io_submit_sqes

These profiles are of marginal use, as you're only profiling fio itself,
not all of the async workers that are running for !FMODE_BUF_RASYNC.

Ah, I got it. Thanks.
How long does the test run? It looks suspect that clock_thread_fn shows
up in the profiles at all.

it runs about 5 msec, randread 4G with bs=4k
And is it actually doing IO, or are you using shm/tmpfs for this test?
Isn't ext4 hosting the file? I see a lot of shmem_getpage_gfp(), makes
me a little confused.

I'm using ext4 on real nvme ssd device. from the call stack, the 
shm_getpage_gfp is from __memset_avx2_erms in libc.
there are ext4 related functions in all the four reports.
I'm doing more to check if it is my test process causing high IOPS in 
case (4).