Kernel Oops related to NFS client code ?

Etienne Lessard <elessard97@xxxxxxxxx> · Tue, 9 Jul 2019 22:11:58 -0400

Hi linux-nfs,

I’m looking for a bit of help on an issue. We’ve had 6 kernel Oops 
running on Linux 4.14.81 (CoreOS 1911.4.0) on 6 similar servers in a few 
 days apart, and we’re scratching our head a little on what could be 
the cause, because it happened all suddenly. All the Oops were the same:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000080
IP: _raw_spin_lock+0xc/0x20
PGD 800000180059b067 P4D 800000180059b067 PUD 16acf66067 PMD 0
Oops: 0002 [#1] SMP PTI
Modules linked in: tcp_diag udp_diag inet_diag ipt_REJECT nf_reject_ipv4 
xt_limit xt_set ip_set_hash_net ip_set ipt_MASQUERADE 
nf_nat_masquerade_ipv4 xt_comment xt_mark iptable_nat nf_conntrack_ipv4 
nf_defrag_ipv4 nf_nat_ipv4 veth nf_conntrack_netlink nfnetlink xfrm_user 
xfrm_algo iptable_filter xt_conntrack nf_nat nf_conntrack libcrc32c 
nfsv3 nfs_acl nfs lockd grace sunrpc fscache overlay 8021q garp mrp 
coretemp x86_pkg_temp_thermal ipmi_ssif kvm_intel iTCO_wdt 
iTCO_vendor_support kvm dcdbas irqbypass ipmi_si mei_me mousedev evdev 
i2c_i801 mei ipmi_devintf bridge stp llc ipmi_msghandler bonding 
pcc_cpufreq button sch_fq_codel nls_ascii nls_cp437 vfat fat dm_verity 
dm_bufio ext4 crc32c_generic crc16 mbcache jbd2 fscrypto hid_generic 
usbhid hid sd_mod crc32c_intel aesni_intel igb ixgbe ahci i2c_algo_bit 
xhci_pci aes_x86_64 libahci i2c_core crypto_simd xhci_hcd cryptd hwmon 
libata glue_helper ptp usbcore pps_core mdio scsi_mod usb_common 
dm_mirror dm_region_hash dm_log dm_mod dax
CPU: 34 PID: 40199 Comm: asterisk Not tainted 4.14.81-coreos #1 Hardware 
name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.3.7 02/08/2018
task: ffff9516d79a0000 task.stack: ffffad77a1b4c000
RIP: 0010:_raw_spin_lock+0xc/0x20
RSP: 0018:ffffad77a1b4fd08 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000080
RBP: 0000000000000030 R08: ffff950d91064800 R09: ffff950782a0b080
R10: ffff950d2cd174a0 R11: 0000000000000001 R12: ffff9507585f9720
R13: ffffd91fa16533c0 R14: ffff950782a0b0c0 R15: ffff950782a0b080
FS:  00007fdb38c25700(0000) GS:ffff950d91040000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000080 CR3: 0000001553a86005 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 nfs_updatepage+0x6da/0x8e0 [nfs]
 ? nfs_flush_incompatible+0xf6/0x150 [nfs]
 nfs_lock+0x9de/0xd00 [nfs]
 ? iov_iter_copy_from_user_atomic+0xdf/0x2f0
 generic_perform_write+0xfc/0x1b0
 nfs_file_write+0xeb/0x340 [nfs]
 __vfs_write+0x101/0x160
 vfs_write+0xad/0x1a0
 SyS_write+0x52/0xc0
 do_syscall_64+0x67/0x120
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7fdbd7196c1d
RSP: 002b:00007fdb38c22e10 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fdbd7196c1d
RDX: 0000000000000004 RSI: 00007fd92800dd50 RDI: 0000000000000072
RBP: 00007fd92800dd50 R08: 00007fd928031810 R09: 00007fdb38c22f8c
R10: 0000000000000038 R11: 0000000000000293 R12: 00007fd928031730
R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000000
Code: 66 0f 1f 44 00 00 31 c0 ba ff 00 00 00 f0 0f b1 17 85 c0 74 05 e8 
b5 23 aa ff 48 89 d8 5b c3 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <f0> 0f 
b1 17 85 c0 75 02 f3 c3 89 c6 e8 b3 0b aa ff 66 90 c3 0f
RIP: _raw_spin_lock+0xc/0x20 RSP: ffffad77a1b4fd08
CR2: 0000000000000080
---[ end trace b0d4430dcc4c0aa0 ]---
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0xb000000 from 0xffffffff81000000 (relocation range: 
0xffffffff80000000-0xffffffffbfffffff)

The “nfs_lock+0x9de” address is a bit misleading, it’s actually the 
nfs_write_end function being called here.

So we are still at the beginning of the troubleshooting, we haven’t been 
able to synthetically reproduce the issue yet. So I just wanted to asked 
if someone had some idea of what could be wrong here / if someone had 
already seen something similar, to help us troubleshooting in the right 
direction.

The NFS mount options used were: 
rw,nosuid,noatime,vers=3,rsize=32768,wsize=32768,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.1.1.1,mountvers=3,mountport=42471,mountproto=tcp,local_lock=none,addr=10.1.1.1.

Thanks