On Thu, Mar 07, 2024 at 06:19:46PM +0000, Matthew Wilcox wrote: > On Thu, Mar 07, 2024 at 10:23:08AM +0100, Jan Kara wrote: > > Thanks for testing! This is an interesting result and certainly unexpected > > for me. The readahead code allocates naturally aligned pages so based on > > the distribution of allocations it seems that before commit ab4443fe3ca6 > > readahead window was at least 32 pages (128KB) aligned and so we allocated > > order 5 pages. After the commit, the readahead window somehow ended up only > > aligned to 20 modulo 32. To follow natural alignment and fill 128KB > > readahead window we allocated order 2 page (got us to offset 24 modulo 32), > > then order 3 page (got us to offset 0 modulo 32), order 4 page (larger > > would not fit in 128KB readahead window now), and order 2 page to finish > > filling the readahead window. > > > > Now I'm not 100% sure why the readahead window alignment changed with > > different rounding when placing readahead mark - probably that's some > > artifact when readahead window is tiny in the beginning before we scale it > > up (I'll verify by tracing whether everything ends up looking correctly > > with the current code). So I don't expect this is a problem in ab4443fe3ca6 > > as such but it exposes the issue that readahead page insertion code should > > perhaps strive to achieve better readahead window alignment with logical > > file offset even at the cost of occasionally performing somewhat shorter > > readahead. I'll look into this once I dig out of the huge heap of email > > after vacation... > > I was surprised by what you said here, so I went and re-read the code > and it doesn't work the way I thought it did. So I had a good long think > about how it _should_ work, and I looked for some more corner conditions, > and this is what I came up with. > > The first thing I've done is separate out the two limits. The EOF is > a hard limit; we will not allocate pages beyond EOF. The ra->size is > a soft limit; we will allocate pages beyond ra->size, but not too far. > > The second thing I noticed is that index + ra_size could wrap. So add > a check for that, and set it to ULONG_MAX. index + ra_size - async_size > could also wrap, but this is harmless. We certainly don't want to kick > off any more readahead in this circumstance, so leaving 'mark' outside > the range [index..ULONG_MAX] is just fine. > > The third thing is that we could allocate a folio which contains a page > at ULONG_MAX. We don't really want that in the page cache; it makes > filesystems more complicated if they have to check for that, and we > don't allow an order-0 folio at ULONG_MAX, so there's no need for it. > This _should_ already be prohibited by the "Don't allocate pages past EOF" > check, but let's explicitly prohibit it. > > Compile tested only. We applied the diff on top of commit ab4443fe3ca6 but got a kernel panic when running the dd test: [ 109.259674][ C46] watchdog: BUG: soft lockup - CPU#46 stuck for 22s! [ dd:8616] [ 109.268946][ C46] Modules linked in: xfs loop intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp btrfs blake2b_generic kvm_intel xor kvm irqbypass crct10dif_pclmul crc32_pclmul sd_mod raid6_pq ghash_clmulni_intel libcrc32c crc32c_intel sg sha512_ssse3 i915 nvme rapl drm_buddy nvme_core intel_gtt ahci t10_pi drm_display_helper ast intel_cstate libahci ipmi_ssif ttm drm_shmem_helper mei_me i2c_i801 crc64_rocksoft_generic video crc64_rocksoft acpi_ipmi intel_uncore megaraid_sas mei drm_kms_helper joydev libata i2c_ismt i2c_smbus dax_hmem crc64 wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter drm fuse ip_tables [ 109.336216][ C46] CPU: 46 PID: 8616 Comm: dd Tainted: G I 6.8.0-rc1-00005-g6c6de6e42e46 #1 [ 109.347892][ C46] Hardware name: NULL NULL/NULL, BIOS 05.02.01 05/12/2023 [ 109.356324][ C46] RIP: 0010:page_cache_ra_order (mm/readahead.c:521) [ 109.363394][ C46] Code: cf 48 89 e8 4c 89 fa 48 d3 e0 48 01 c2 75 09 83 e9 01 48 89 e8 48 d3 e0 49 8d 77 ff 48 01 f0 49 39 c6 73 11 83 e9 01 48 89 e8 <48> d3 e0 48 01 f0 49 39 c6 72 ef 31 c0 83 f9 01 8b 3c 24 0f 44 c8 All code ======== 0: cf iret 1: 48 89 e8 mov %rbp,%rax 4: 4c 89 fa mov %r15,%rdx 7: 48 d3 e0 shl %cl,%rax a: 48 01 c2 add %rax,%rdx d: 75 09 jne 0x18 f: 83 e9 01 sub $0x1,%ecx 12: 48 89 e8 mov %rbp,%rax 15: 48 d3 e0 shl %cl,%rax 18: 49 8d 77 ff lea -0x1(%r15),%rsi 1c: 48 01 f0 add %rsi,%rax 1f: 49 39 c6 cmp %rax,%r14 22: 73 11 jae 0x35 24: 83 e9 01 sub $0x1,%ecx 27: 48 89 e8 mov %rbp,%rax 2a:* 48 d3 e0 shl %cl,%rax <-- trapping instruction 2d: 48 01 f0 add %rsi,%rax 30: 49 39 c6 cmp %rax,%r14 33: 72 ef jb 0x24 35: 31 c0 xor %eax,%eax 37: 83 f9 01 cmp $0x1,%ecx 3a: 8b 3c 24 mov (%rsp),%edi 3d: 0f 44 c8 cmove %eax,%ecx Code starting with the faulting instruction =========================================== 0: 48 d3 e0 shl %cl,%rax 3: 48 01 f0 add %rsi,%rax 6: 49 39 c6 cmp %rax,%r14 9: 72 ef jb 0xfffffffffffffffa b: 31 c0 xor %eax,%eax d: 83 f9 01 cmp $0x1,%ecx 10: 8b 3c 24 mov (%rsp),%edi 13: 0f 44 c8 cmove %eax,%ecx [ 109.385897][ C46] RSP: 0018:ffa0000012837c00 EFLAGS: 00000206 [ 109.393176][ C46] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000020159674 [ 109.402607][ C46] RDX: 000000000004924c RSI: 0000000000049249 RDI: ff11003f6f3ae7c0 [ 109.412038][ C46] RBP: 0000000000000001 R08: 0000000000038700 R09: 0000000000000013 [ 109.421447][ C46] R10: 0000000000022c04 R11: 0000000000000001 R12: ffa0000012837cb0 [ 109.430868][ C46] R13: ffd400004fee4b40 R14: 0000000000049249 R15: 000000000004924a [ 109.440270][ C46] FS: 00007f777e884640(0000) GS:ff11003f6f380000(0000) knlGS:0000000000000000 [ 109.450756][ C46] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 109.458603][ C46] CR2: 00007f2d4d425020 CR3: 00000001b4f84005 CR4: 0000000000f71ef0 [ 109.468003][ C46] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 109.477392][ C46] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 109.486794][ C46] PKRU: 55555554 [ 109.491197][ C46] Call Trace: [ 109.495300][ C46] <IRQ> [ 109.498922][ C46] ? watchdog_timer_fn (kernel/watchdog.c:548) [ 109.505074][ C46] ? __pfx_watchdog_timer_fn (kernel/watchdog.c:466) [ 109.511620][ C46] ? __hrtimer_run_queues (kernel/time/hrtimer.c:1688 kernel/time/hrtimer.c:1752) [ 109.518059][ C46] ? hrtimer_interrupt (kernel/time/hrtimer.c:1817) [ 109.524088][ C46] ? __sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1065 arch/x86/kernel/apic/apic.c:1082) [ 109.531286][ C46] ? sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1076 (discriminator 14)) [ 109.538190][ C46] </IRQ> [ 109.541867][ C46] <TASK> [ 109.545545][ C46] ? asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:649) [ 109.552832][ C46] ? page_cache_ra_order (mm/readahead.c:521) [ 109.559122][ C46] filemap_get_pages (mm/filemap.c:2500) [ 109.564935][ C46] filemap_read (mm/filemap.c:2594) [ 109.570241][ C46] xfs_file_buffered_read (fs/xfs/xfs_file.c:315) xfs [ 109.577202][ C46] xfs_file_read_iter (fs/xfs/xfs_file.c:341) xfs [ 109.583749][ C46] vfs_read (include/linux/fs.h:2079 fs/read_write.c:395 fs/read_write.c:476) [ 109.588762][ C46] ksys_read (fs/read_write.c:619) [ 109.593660][ C46] do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) [ 109.599038][ C46] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129) [ 109.605982][ C46] RIP: 0033:0x7f777e78d3ce [ 109.611255][ C46] Code: c0 e9 b6 fe ff ff 50 48 8d 3d 6e 08 0b 00 e8 69 01 02 00 66 0f 1f 84 00 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0 ff ff 77 5a c3 66 0f 1f 84 00 00 00 00 00 48 83 ec 28 All code ======== 0: c0 e9 b6 shr $0xb6,%cl 3: fe (bad) 4: ff (bad) 5: ff 50 48 call *0x48(%rax) 8: 8d 3d 6e 08 0b 00 lea 0xb086e(%rip),%edi # 0xb087c e: e8 69 01 02 00 call 0x2017c 13: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 1a: 00 00 1c: 64 8b 04 25 18 00 00 mov %fs:0x18,%eax 23: 00 24: 85 c0 test %eax,%eax 26: 75 14 jne 0x3c 28: 0f 05 syscall 2a:* 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax <-- trapping instruction 30: 77 5a ja 0x8c 32: c3 ret 33: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 3a: 00 00 3c: 48 83 ec 28 sub $0x28,%rsp Code starting with the faulting instruction =========================================== 0: 48 3d 00 f0 ff ff cmp $0xfffffffffffff000,%rax 6: 77 5a ja 0x62 8: c3 ret 9: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) 10: 00 00 12: 48 83 ec 28 sub $0x28,%rsp [ 109.633619][ C46] RSP: 002b:00007ffc78ab2778 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 109.643392][ C46] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f777e78d3ce [ 109.652686][ C46] RDX: 0000000000001000 RSI: 00005629c0f7c000 RDI: 0000000000000000 [ 109.661976][ C46] RBP: 00005629c0f7c000 R08: 00005629c0f7bd30 R09: 00007f777e870be0 [ 109.671251][ C46] R10: 00005629c0f7c000 R11: 0000000000000246 R12: 0000000000000000 [ 109.680528][ C46] R13: 0000000000000000 R14: 0000000000000000 R15: ffffffffffffffff [ 109.689808][ C46] </TASK> [ 109.693512][ C46] Kernel panic - not syncing: softlockup: hung tasks # mm/readahead.c 486 void page_cache_ra_order(struct readahead_control *ractl, 487 struct file_ra_state *ra, unsigned int new_order) 488 { 489 struct address_space *mapping = ractl->mapping; 490 pgoff_t index = readahead_index(ractl); 491 pgoff_t last = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT; 492 pgoff_t limit = index + ra->size; 493 pgoff_t mark = index + ra->size - ra->async_size; 494 int err = 0; 495 gfp_t gfp = readahead_gfp_mask(mapping); 496 497 if (!mapping_large_folio_support(mapping) || ra->size < 4) 498 goto fallback; 499 500 if (new_order < MAX_PAGECACHE_ORDER) { 501 new_order += 2; 502 if (new_order > MAX_PAGECACHE_ORDER) 503 new_order = MAX_PAGECACHE_ORDER; 504 while ((1 << new_order) > ra->size) 505 new_order--; 506 } 507 508 if (limit < index) 509 limit = ULONG_MAX; 510 511 filemap_invalidate_lock_shared(mapping); 512 while (index < limit) { 513 unsigned int order = new_order; 514 515 /* Align with smaller pages if needed */ 516 if (index & ((1UL << order) - 1)) 517 order = __ffs(index); 518 if (index + (1UL << order) == 0) 519 order--; 520 /* Don't allocate pages past EOF */ 521 while (index + (1UL << order) - 1 > last) 522 order--; 523 /* THP machinery does not support order-1 */ 524 if (order == 1) 525 order = 0; 526 err = ra_alloc_folio(ractl, index, mark, order, gfp); 527 if (err) 528 break; 529 index += 1UL << order; 530 } 531 532 if (index > limit) { 533 ra->size += index - limit - 1; 534 ra->async_size += index - limit - 1; 535 } 536 537 read_pages(ractl); 538 filemap_invalidate_unlock_shared(mapping); 539 540 /* 541 * If there were already pages in the page cache, then we may have 542 * left some gaps. Let the regular readahead code take care of this 543 * situation. 544 */ 545 if (!err) 546 return; 547 fallback: 548 do_page_cache_ra(ractl, ra->size, ra->async_size); 549 } Regards, Yujie > diff --git a/mm/readahead.c b/mm/readahead.c > index 130c0e7df99f..742e1f39035b 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -488,7 +488,8 @@ void page_cache_ra_order(struct readahead_control *ractl, > { > struct address_space *mapping = ractl->mapping; > pgoff_t index = readahead_index(ractl); > - pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT; > + pgoff_t last = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT; > + pgoff_t limit = index + ra->size; > pgoff_t mark = index + ra->size - ra->async_size; > int err = 0; > gfp_t gfp = readahead_gfp_mask(mapping); > @@ -496,23 +497,26 @@ void page_cache_ra_order(struct readahead_control *ractl, > if (!mapping_large_folio_support(mapping) || ra->size < 4) > goto fallback; > > - limit = min(limit, index + ra->size - 1); > - > if (new_order < MAX_PAGECACHE_ORDER) { > new_order += 2; > new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order); > new_order = min_t(unsigned int, new_order, ilog2(ra->size)); > } > > + if (limit < index) > + limit = ULONG_MAX; > filemap_invalidate_lock_shared(mapping); > - while (index <= limit) { > + while (index < limit) { > unsigned int order = new_order; > > /* Align with smaller pages if needed */ > if (index & ((1UL << order) - 1)) > order = __ffs(index); > + /* Avoid wrap */ > + if (index + (1UL << order) == 0) > + order--; > /* Don't allocate pages past EOF */ > - while (index + (1UL << order) - 1 > limit) > + while (index + (1UL << order) - 1 > last) > order--; > err = ra_alloc_folio(ractl, index, mark, order, gfp); > if (err)