Re: [linus:master] [readahead] ab4443fe3c: vm-scalability.throughput -21.4% regression

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Mar 07, 2024 at 06:19:46PM +0000, Matthew Wilcox wrote:
> On Thu, Mar 07, 2024 at 10:23:08AM +0100, Jan Kara wrote:
> > Thanks for testing! This is an interesting result and certainly unexpected
> > for me. The readahead code allocates naturally aligned pages so based on
> > the distribution of allocations it seems that before commit ab4443fe3ca6
> > readahead window was at least 32 pages (128KB) aligned and so we allocated
> > order 5 pages. After the commit, the readahead window somehow ended up only
> > aligned to 20 modulo 32. To follow natural alignment and fill 128KB
> > readahead window we allocated order 2 page (got us to offset 24 modulo 32),
> > then order 3 page (got us to offset 0 modulo 32), order 4 page (larger
> > would not fit in 128KB readahead window now), and order 2 page to finish
> > filling the readahead window.
> > 
> > Now I'm not 100% sure why the readahead window alignment changed with
> > different rounding when placing readahead mark - probably that's some
> > artifact when readahead window is tiny in the beginning before we scale it
> > up (I'll verify by tracing whether everything ends up looking correctly
> > with the current code). So I don't expect this is a problem in ab4443fe3ca6
> > as such but it exposes the issue that readahead page insertion code should
> > perhaps strive to achieve better readahead window alignment with logical
> > file offset even at the cost of occasionally performing somewhat shorter
> > readahead. I'll look into this once I dig out of the huge heap of email
> > after vacation...
> 
> I was surprised by what you said here, so I went and re-read the code
> and it doesn't work the way I thought it did.  So I had a good long think
> about how it _should_ work, and I looked for some more corner conditions,
> and this is what I came up with.
> 
> The first thing I've done is separate out the two limits.  The EOF is
> a hard limit; we will not allocate pages beyond EOF.  The ra->size is
> a soft limit; we will allocate pages beyond ra->size, but not too far.
> 
> The second thing I noticed is that index + ra_size could wrap.  So add
> a check for that, and set it to ULONG_MAX.  index + ra_size - async_size
> could also wrap, but this is harmless.  We certainly don't want to kick
> off any more readahead in this circumstance, so leaving 'mark' outside
> the range [index..ULONG_MAX] is just fine.
> 
> The third thing is that we could allocate a folio which contains a page
> at ULONG_MAX.  We don't really want that in the page cache; it makes
> filesystems more complicated if they have to check for that, and we
> don't allow an order-0 folio at ULONG_MAX, so there's no need for it.
> This _should_ already be prohibited by the "Don't allocate pages past EOF"
> check, but let's explicitly prohibit it.
> 
> Compile tested only.

We applied the diff on top of commit ab4443fe3ca6 but got a kernel panic
when running the dd test:

[ 109.259674][ C46] watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [ dd:8616]
[ 109.268946][ C46] Modules linked in: xfs loop intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp btrfs blake2b_generic kvm_intel xor kvm irqbypass crct10dif_pclmul crc32_pclmul sd_mod raid6_pq ghash_clmulni_intel libcrc32c crc32c_intel sg sha512_ssse3 i915 nvme rapl drm_buddy nvme_core intel_gtt ahci t10_pi drm_display_helper ast intel_cstate libahci ipmi_ssif ttm drm_shmem_helper mei_me i2c_i801 crc64_rocksoft_generic video crc64_rocksoft acpi_ipmi intel_uncore megaraid_sas mei drm_kms_helper joydev libata i2c_ismt i2c_smbus dax_hmem crc64 wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter drm fuse ip_tables
[ 109.336216][ C46] CPU: 46 PID: 8616 Comm: dd Tainted: G          I        6.8.0-rc1-00005-g6c6de6e42e46 #1
[ 109.347892][ C46] Hardware name: NULL NULL/NULL, BIOS 05.02.01 05/12/2023
[ 109.356324][ C46] RIP: 0010:page_cache_ra_order (mm/readahead.c:521)
[ 109.363394][ C46] Code: cf 48 89 e8 4c 89 fa 48 d3 e0 48 01 c2 75 09 83 e9 01 48 89 e8 48 d3 e0 49 8d 77 ff 48 01 f0 49 39 c6 73 11 83 e9 01 48 89 e8 <48> d3 e0 48 01 f0 49 39 c6 72 ef 31 c0 83 f9 01 8b 3c 24 0f 44 c8
All code
========
   0:   cf                      iret
   1:   48 89 e8                mov    %rbp,%rax
   4:   4c 89 fa                mov    %r15,%rdx
   7:   48 d3 e0                shl    %cl,%rax
   a:   48 01 c2                add    %rax,%rdx
   d:   75 09                   jne    0x18
   f:   83 e9 01                sub    $0x1,%ecx
  12:   48 89 e8                mov    %rbp,%rax
  15:   48 d3 e0                shl    %cl,%rax
  18:   49 8d 77 ff             lea    -0x1(%r15),%rsi
  1c:   48 01 f0                add    %rsi,%rax
  1f:   49 39 c6                cmp    %rax,%r14
  22:   73 11                   jae    0x35
  24:   83 e9 01                sub    $0x1,%ecx
  27:   48 89 e8                mov    %rbp,%rax
  2a:*  48 d3 e0                shl    %cl,%rax         <-- trapping instruction
  2d:   48 01 f0                add    %rsi,%rax
  30:   49 39 c6                cmp    %rax,%r14
  33:   72 ef                   jb     0x24
  35:   31 c0                   xor    %eax,%eax
  37:   83 f9 01                cmp    $0x1,%ecx
  3a:   8b 3c 24                mov    (%rsp),%edi
  3d:   0f 44 c8                cmove  %eax,%ecx

Code starting with the faulting instruction
===========================================
   0:   48 d3 e0                shl    %cl,%rax
   3:   48 01 f0                add    %rsi,%rax
   6:   49 39 c6                cmp    %rax,%r14
   9:   72 ef                   jb     0xfffffffffffffffa
   b:   31 c0                   xor    %eax,%eax
   d:   83 f9 01                cmp    $0x1,%ecx
  10:   8b 3c 24                mov    (%rsp),%edi
  13:   0f 44 c8                cmove  %eax,%ecx
[ 109.385897][ C46] RSP: 0018:ffa0000012837c00 EFLAGS: 00000206
[ 109.393176][ C46] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000020159674
[ 109.402607][ C46] RDX: 000000000004924c RSI: 0000000000049249 RDI: ff11003f6f3ae7c0
[ 109.412038][ C46] RBP: 0000000000000001 R08: 0000000000038700 R09: 0000000000000013
[ 109.421447][ C46] R10: 0000000000022c04 R11: 0000000000000001 R12: ffa0000012837cb0
[ 109.430868][ C46] R13: ffd400004fee4b40 R14: 0000000000049249 R15: 000000000004924a
[ 109.440270][ C46] FS:  00007f777e884640(0000) GS:ff11003f6f380000(0000) knlGS:0000000000000000
[ 109.450756][ C46] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 109.458603][ C46] CR2: 00007f2d4d425020 CR3: 00000001b4f84005 CR4: 0000000000f71ef0
[ 109.468003][ C46] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 109.477392][ C46] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 109.486794][ C46] PKRU: 55555554
[ 109.491197][ C46] Call Trace:
[ 109.495300][ C46]  <IRQ>
[ 109.498922][ C46] ? watchdog_timer_fn (kernel/watchdog.c:548)
[ 109.505074][ C46] ? __pfx_watchdog_timer_fn (kernel/watchdog.c:466)
[ 109.511620][ C46] ? __hrtimer_run_queues (kernel/time/hrtimer.c:1688 kernel/time/hrtimer.c:1752)
[ 109.518059][ C46] ? hrtimer_interrupt (kernel/time/hrtimer.c:1817)
[ 109.524088][ C46] ? __sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1065 arch/x86/kernel/apic/apic.c:1082)
[ 109.531286][ C46] ? sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1076 (discriminator 14))
[ 109.538190][ C46]  </IRQ>
[ 109.541867][ C46]  <TASK>
[ 109.545545][ C46] ? asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:649)
[ 109.552832][ C46] ? page_cache_ra_order (mm/readahead.c:521)
[ 109.559122][ C46] filemap_get_pages (mm/filemap.c:2500)
[ 109.564935][ C46] filemap_read (mm/filemap.c:2594)
[ 109.570241][ C46] xfs_file_buffered_read (fs/xfs/xfs_file.c:315) xfs
[ 109.577202][ C46] xfs_file_read_iter (fs/xfs/xfs_file.c:341) xfs
[ 109.583749][ C46] vfs_read (include/linux/fs.h:2079 fs/read_write.c:395 fs/read_write.c:476)
[ 109.588762][ C46] ksys_read (fs/read_write.c:619)
[ 109.593660][ C46] do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
[ 109.599038][ C46] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
[ 109.605982][ C46] RIP: 0033:0x7f777e78d3ce
[ 109.611255][ C46] Code: c0 e9 b6 fe ff ff 50 48 8d 3d 6e 08 0b 00 e8 69 01 02 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28
All code
========
   0:   c0 e9 b6                shr    $0xb6,%cl
   3:   fe                      (bad)
   4:   ff                      (bad)
   5:   ff 50 48                call   *0x48(%rax)
   8:   8d 3d 6e 08 0b 00       lea    0xb086e(%rip),%edi        # 0xb087c
   e:   e8 69 01 02 00          call   0x2017c
  13:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  1a:   00 00
  1c:   64 8b 04 25 18 00 00    mov    %fs:0x18,%eax
  23:   00
  24:   85 c0                   test   %eax,%eax
  26:   75 14                   jne    0x3c
  28:   0f 05                   syscall
  2a:*  48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax         <-- trapping instruction
  30:   77 5a                   ja     0x8c
  32:   c3                      ret
  33:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  3a:   00 00
  3c:   48 83 ec 28             sub    $0x28,%rsp

Code starting with the faulting instruction
===========================================
   0:   48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
   6:   77 5a                   ja     0x62
   8:   c3                      ret
   9:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  10:   00 00
  12:   48 83 ec 28             sub    $0x28,%rsp
[ 109.633619][ C46] RSP: 002b:00007ffc78ab2778 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 109.643392][ C46] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f777e78d3ce
[ 109.652686][ C46] RDX: 0000000000001000 RSI: 00005629c0f7c000 RDI: 0000000000000000
[ 109.661976][ C46] RBP: 00005629c0f7c000 R08: 00005629c0f7bd30 R09: 00007f777e870be0
[ 109.671251][ C46] R10: 00005629c0f7c000 R11: 0000000000000246 R12: 0000000000000000
[ 109.680528][ C46] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffffffffffff
[ 109.689808][ C46]  </TASK>
[ 109.693512][ C46] Kernel panic - not syncing: softlockup: hung tasks


# mm/readahead.c

486 void page_cache_ra_order(struct readahead_control *ractl,
487                 struct file_ra_state *ra, unsigned int new_order)
488 {
489         struct address_space *mapping = ractl->mapping;
490         pgoff_t index = readahead_index(ractl);
491         pgoff_t last = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
492         pgoff_t limit = index + ra->size;
493         pgoff_t mark = index + ra->size - ra->async_size;
494         int err = 0;
495         gfp_t gfp = readahead_gfp_mask(mapping);
496
497         if (!mapping_large_folio_support(mapping) || ra->size < 4)
498                 goto fallback;
499
500         if (new_order < MAX_PAGECACHE_ORDER) {
501                 new_order += 2;
502                 if (new_order > MAX_PAGECACHE_ORDER)
503                         new_order = MAX_PAGECACHE_ORDER;
504                 while ((1 << new_order) > ra->size)
505                         new_order--;
506         }
507
508         if (limit < index)
509                 limit = ULONG_MAX;
510
511         filemap_invalidate_lock_shared(mapping);
512         while (index < limit) {
513                 unsigned int order = new_order;
514
515                 /* Align with smaller pages if needed */
516                 if (index & ((1UL << order) - 1))
517                         order = __ffs(index);
518                 if (index + (1UL << order) == 0)
519                         order--;
520                 /* Don't allocate pages past EOF */
521                 while (index + (1UL << order) - 1 > last)
522                         order--;
523                 /* THP machinery does not support order-1 */
524                 if (order == 1)
525                         order = 0;
526                 err = ra_alloc_folio(ractl, index, mark, order, gfp);
527                 if (err)
528                         break;
529                 index += 1UL << order;
530         }
531
532         if (index > limit) {
533                 ra->size += index - limit - 1;
534                 ra->async_size += index - limit - 1;
535         }
536
537         read_pages(ractl);
538         filemap_invalidate_unlock_shared(mapping);
539
540         /*
541          * If there were already pages in the page cache, then we may have
542          * left some gaps.  Let the regular readahead code take care of this
543          * situation.
544          */
545         if (!err)
546                 return;
547 fallback:
548         do_page_cache_ra(ractl, ra->size, ra->async_size);
549 }


Regards,
Yujie

> diff --git a/mm/readahead.c b/mm/readahead.c
> index 130c0e7df99f..742e1f39035b 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -488,7 +488,8 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  {
>  	struct address_space *mapping = ractl->mapping;
>  	pgoff_t index = readahead_index(ractl);
> -	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
> +	pgoff_t last = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
> +	pgoff_t limit = index + ra->size;
>  	pgoff_t mark = index + ra->size - ra->async_size;
>  	int err = 0;
>  	gfp_t gfp = readahead_gfp_mask(mapping);
> @@ -496,23 +497,26 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  	if (!mapping_large_folio_support(mapping) || ra->size < 4)
>  		goto fallback;
>  
> -	limit = min(limit, index + ra->size - 1);
> -
>  	if (new_order < MAX_PAGECACHE_ORDER) {
>  		new_order += 2;
>  		new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
>  		new_order = min_t(unsigned int, new_order, ilog2(ra->size));
>  	}
>  
> +	if (limit < index)
> +		limit = ULONG_MAX;
>  	filemap_invalidate_lock_shared(mapping);
> -	while (index <= limit) {
> +	while (index < limit) {
>  		unsigned int order = new_order;
>  
>  		/* Align with smaller pages if needed */
>  		if (index & ((1UL << order) - 1))
>  			order = __ffs(index);
> +		/* Avoid wrap */
> +		if (index + (1UL << order) == 0)
> +			order--;
>  		/* Don't allocate pages past EOF */
> -		while (index + (1UL << order) - 1 > limit)
> +		while (index + (1UL << order) - 1 > last)
>  			order--;
>  		err = ra_alloc_folio(ractl, index, mark, order, gfp);
>  		if (err)




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux