On Tue, Jul 2, 2024 at 5:17 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > > > > > On Jul 2, 2024, at 4:59 PM, David Wysochanski <dwysocha@xxxxxxxxxx> wrote: > > > > Sorry for the late reply, more below. > > > > On Wed, Jun 19, 2024 at 4:46 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > >> > >> Hi Dave- > >> > >> I'm testing pNFS SCSI layouts with tracing enabled on the > >> client, and I hit this BUG twice today: > >> > >> BUG: kernel NULL pointer dereference, address: 0000000000000000 > >> #PF: supervisor read access in kernel mode > >> #PF: error_code(0x0000) - not-present page > >> PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI > >> CPU: 0 PID: 52230 Comm: fio Not tainted 6.10.0-rc4-g9f1e0e495093 #5 > >> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014 > >> RIP: 0010:nfs_folio_length+0x29/0x170 [nfs] > >> Code: 90 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b 07 a9 00 00 04 00 74 0c 48 8b 07 f6 c4 08 0f 85 89 00 00 00 48 8b 43 18 <48> 8b 00 48 8b 68 50 48 85 ed 7e 66 48 8b 03 a8 40 74 7c 4c 8b 6b > >> RSP: 0018:ffffb0800c2abb28 EFLAGS: 00010246 > >> RAX: 0000000000000000 RBX: ffffe5d8c44f4a00 RCX: ffffffffa8525000 > >> RDX: 0017ffffd0000028 RSI: 000000000000004e RDI: ffffe5d8c44f4a00 > >> RBP: ffff9871911501b8 R08: ffff987191150020 R09: 0000000000000003 > >> R10: 0000000000000006 R11: 0000000000000000 R12: ffffe5d8c44f4a00 > >> R13: ffffb0800c2abb60 R14: 0000000000000000 R15: ffffe5d8c44f4a00 > >> FS: 00007f1d25d6f080(0000) GS:ffff987277c00000(0000) knlGS:0000000000000000 > >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >> CR2: 0000000000000000 CR3: 000000013cce0005 CR4: 0000000000170ef0 > >> Call Trace: > >> <TASK> > >> ? __die_body.cold+0x19/0x27 > >> ? page_fault_oops+0xca/0x2a0 > >> ? search_module_extables+0x19/0x60 > >> ? search_bpf_extables+0x5f/0x80 > >> ? exc_page_fault+0x7e/0x180 > >> ? asm_exc_page_fault+0x26/0x30 > >> ? nfs_folio_length+0x29/0x170 [nfs] > >> trace_event_raw_event_nfs_folio_event_done+0xdc/0x170 [nfs] > >> nfs_read_folio+0x16d/0x260 [nfs] > >> ? __pfx_nfs_read_folio+0x10/0x10 [nfs] > >> filemap_read_folio+0x43/0xe0 > >> filemap_fault+0x70d/0xd00 > >> __do_fault+0x35/0x120 > >> do_fault+0xbb/0x470 > >> __handle_mm_fault+0x7e9/0x1060 > >> ? mt_find+0x21c/0x570 > >> handle_mm_fault+0xf0/0x300 > >> do_user_addr_fault+0x217/0x620 > >> exc_page_fault+0x7e/0x180 > >> asm_exc_page_fault+0x26/0x30 > >> RIP: 0033:0x7f1d25ecedb7 > >> Code: 7e 6f 44 16 e0 48 29 fe 48 83 e1 e0 48 01 ce 0f 1f 40 00 c5 fe 6f 4e 60 c5 fe 6f 56 40 c5 fe 6f 5e 20 c5 fe 6f 26 48 83 c6 80 <c5> fd 7f 49 60 c5 fd 7f 51 40 c5 fd 7f 59 20 c5 fd 7f 21 48 83 c1 > >> RSP: 002b:00007ffd92df7f58 EFLAGS: 00010203 > >> RAX: 00007f1d1d0e6000 RBX: 000056211867f6c0 RCX: 00007f1d1d0e6f60 > >> RDX: 0000000000001000 RSI: 0000562118699ee0 RDI: 00007f1d1d0e6000 > >> RBP: 00007ffd92df7f70 R08: 0000000000000000 R09: 0000000000000000 > >> R10: 0000562118688f30 R11: 0000000000000001 R12: 00007f1d1d16aac0 > >> R13: 00007ffd92df8030 R14: 0000000000000001 R15: 0000000000001000 > >> </TASK> > >> Modules linked in: blocklayoutdriver rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_c> > >> CR2: 0000000000000000 > >> ---[ end trace 0000000000000000 ]--- > >> RIP: 0010:nfs_folio_length+0x29/0x170 [nfs] > >> Code: 90 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b 07 a9 00 00 04 00 74 0c 48 8b 07 f6 c4 08 0f 85 89 00 00 00 48 8b 43 18 <48> 8b 00 48 8b 68 50 48 85 ed 7e 66 48 8b 03 a8 40 74 7c 4c 8b 6b > >> RSP: 0018:ffffb0800c2abb28 EFLAGS: 00010246 > >> RAX: 0000000000000000 RBX: ffffe5d8c44f4a00 RCX: ffffffffa8525000 > >> RDX: 0017ffffd0000028 RSI: 000000000000004e RDI: ffffe5d8c44f4a00 > >> RBP: ffff9871911501b8 R08: ffff987191150020 R09: 0000000000000003 > >> R10: 0000000000000006 R11: 0000000000000000 R12: ffffe5d8c44f4a00 > >> R13: ffffb0800c2abb60 R14: 0000000000000000 R15: ffffe5d8c44f4a00 > >> FS: 00007f1d25d6f080(0000) GS:ffff987277c00000(0000) knlGS:0000000000000000 > >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >> CR2: 0000000000000000 CR3: 000000013cce0005 CR4: 0000000000170ef0 > >> > >> cel@renoir:linux$ scripts/faddr2line fs/nfs/nfs.ko nfs_folio_length+0x29 > >> nfs_folio_length+0x29/0x170: > >> nfs_folio_length at /home/cel/src/kdevops/cel/linux/fs/nfs/internal.h:824 (discriminator 1) > >> > >> Which in my version of the source code is: > >> > >> 822 static inline size_t nfs_folio_length(struct folio *folio) > >> 823 { > >> 824 loff_t i_size = i_size_read(folio_file_mapping(folio)->host); > >> > > > > Ok so looks like we need to go deeper to know what the NULL pointer > > represented in the BUG statement. > > Decoding the oops: > > $ ~/git/kernel/scripts/decodecode < ./oops.txt > > Code: 90 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b 07 a9 00 > > 00 04 00 74 0c 48 8b 07 f6 c4 08 0f 85 89 00 00 00 48 8b 43 18 <48> 8b > > 00 48 8b 68 50 48 85 ed 7e 66 48 8b 03 a8 40 74 7c 4c 8b 6b > > All code > > ======== > > 0: 90 nop > > 1: 41 56 push %r14 > > 3: 41 55 push %r13 > > 5: 41 54 push %r12 > > 7: 55 push %rbp > > 8: 53 push %rbx > > 9: 48 89 fb mov %rdi,%rbx > > c: 48 83 ec 08 sub $0x8,%rsp > > 10: 48 8b 07 mov (%rdi),%rax > > 13: a9 00 00 04 00 test $0x40000,%eax > > 18: 74 0c je 0x26 > > 1a: 48 8b 07 mov (%rdi),%rax > > 1d: f6 c4 08 test $0x8,%ah > > 20: 0f 85 89 00 00 00 jne 0xaf > > 26: 48 8b 43 18 mov 0x18(%rbx),%rax > > 2a:* 48 8b 00 mov (%rax),%rax <-- trapping instruction > > 2d: 48 8b 68 50 mov 0x50(%rax),%rbp > > 31: 48 85 ed test %rbp,%rbp > > 34: 7e 66 jle 0x9c > > 36: 48 8b 03 mov (%rbx),%rax > > 39: a8 40 test $0x40,%al > > 3b: 74 7c je 0xb9 > > 3d: 4c rex.WR > > 3e: 8b .byte 0x8b > > 3f: 6b .byte 0x6b > > > > Code starting with the faulting instruction > > =========================================== > > 0: 48 8b 00 mov (%rax),%rax > > 3: 48 8b 68 50 mov 0x50(%rax),%rbp > > 7: 48 85 ed test %rbp,%rbp > > a: 7e 66 jle 0x72 > > c: 48 8b 03 mov (%rbx),%rax > > f: a8 40 test $0x40,%al > > 11: 74 7c je 0x8f > > 13: 4c rex.WR > > 14: 8b .byte 0x8b > > 15: 6b .byte 0x6b > > > > > > Snip out the essential instructions, which shows we just have read an > > address at an offset of 0x18, then that must have been NULL and we > > tried to dereference NULL which led to the BUG: > > 9: 48 89 fb mov %rdi,%rbx > > ... > > 26: 48 8b 43 18 mov 0x18(%rbx),%rax > > 2a:* 48 8b 00 mov (%rax),%rax <-- trapping instruction > > > > So the question is what is at offset 0x18 in the above code in > > nfs_folio_length()? > > Now folio_file_mapping() returns an address_space pointer ,and > > address_space->host is at offset 0. > > So if 'mapping' is at offset 0x18 of folio, then mapping is NULL. And > > from the code, this seems to fit. > > > > 310 struct folio { > > 311 /* private: don't document the anon union */ > > 312 union { > > 313 struct { > > 314 /* public: */ > > 315 unsigned long flags; > > 316 union { > > 317 struct list_head lru; > > 318 /* private: avoid cluttering the output */ > > 319 struct { > > 320 void *__filler; > > 321 /* public: */ > > 322 unsigned int mlock_count; > > 323 /* private: */ > > 324 }; > > 325 /* public: */ > > 326 }; > > 327 struct address_space *mapping; > > 328 pgoff_t index; > > > > > > 445 static inline struct address_space *folio_file_mapping(struct folio *folio) > > 446 { > > 447 if (unlikely(folio_test_swapcache(folio))) > > 448 return swapcache_mapping(folio); > > 449 > > 450 return folio->mapping; > > 451 } > > > > So it seems folio->mapping == NULL and that is why we BUG. > > > > > >> I'm pinging you first about this because you changed the > >> code around the trace_nfs_aop_read_done() call site in > >> commit 000dbe0bec05 ("NFS: Convert buffered read paths to > >> use netfs when fscache is enabled"). > >> > > > > If I have the above analysis right, then maybe this patch would fix it? > > Also I wonder if nfs_page_length() might have the same issue? > > > > diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h > > index 9f0f4534744b..832f07994b88 100644 > > --- a/fs/nfs/internal.h > > +++ b/fs/nfs/internal.h > > @@ -821,7 +821,12 @@ unsigned int nfs_page_length(struct page *page) > > */ > > static inline size_t nfs_folio_length(struct folio *folio) > > { > > - loff_t i_size = i_size_read(folio_file_mapping(folio)->host); > > + loff_t i_size; > > + > > + if (!folio->mapping) > > + return 0; > > + > > + i_size = i_size_read(folio_file_mapping(folio)->host); > > > > if (i_size > 0) { > > pgoff_t index = folio_index(folio) >> folio_order(folio); > > Execution passed trace_nfs_aop_readpage() without incident, > so folio->mapping was a valid pointer as we entered > nfs_read_folio(). > > What I don't understand is whether you expect folio->mapping > to become NULL while nfs_read_folio() is running. > Was your test running both reads and truncates on the same file in parallel? Once the read completes I believe the page is unlocked. Then if truncates are running in parallel, then I think this can happen (folio->mapping = NULL as a result of truncate). > > -- > Chuck Lever > >