Re: [regression] nfsstat/nfsd crash system "general protection fault, probably for non-canonical address ..." after 6.8.9->6.8.10 update

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[CCing Greg and stable and regressions list]

Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.

Please correct me if I'm wrong, but since 6.8.10 we afaics have below
regression report about a general protection fault as well as two
reports about NFSd related NULL pointer dereferences[1] that got some
attention[2]. From a quick look I saw no fixes for those queued for the
next 6.8.y release. And the series might be EOL soon.

Hmmm. That sounds not really good. Is there some easy way to fix this?

Chuck, you earlier mentioned (see quote below) that Greg pulled in three
changes as dep into 6.8.10 that might be unneeded. Might reverting all
three be the best way forward?

Ciao, Thorsten

[1]
https://lore.kernel.org/all/CAK2bqVJoT3yy2m0OmTnqH9EAKkj6O1iTk42EyyMtvvxKh6YDKg@xxxxxxxxxxxxxx/
and https://lore.kernel.org/all/A8DQDS.ZXN0FMYZ3DIM1@xxxxxxxxx/

[2] https://x.com/spendergrsec/status/1793489498443252143

On 24.05.24 20:21, Chuck Lever III wrote:
>> On May 24, 2024, at 4:59 AM, Jaroslav Pulchart <jaroslav.pulchart@xxxxxxxxxxxx> wrote:
>>
>>>
>>>>
>>>> On Wed, May 22, 2024 at 04:36:57AM -0400, Jaroslav Pulchart wrote:
>>>>> Hello,
>>>>>
>>>>> I would like to report some issue causing a "general protection fault"
>>>>> crash (constantly) after we updated the kernel from 6.8.9 to 6.8.10.
>>>>> This is triggered when monitoring is using nfsstat on a server where
>>>>> nfsd is running.
>>>>>
>>>>> [ 3049.260633] general protection fault, probably for non-canonical
>>>>> address 0x66fb103e19e9cc89: 0000 [#1] PREEMPT SMP NOPTI
>>>>> [ 3049.261628] CPU: 22 PID: 74991 Comm: nfsstat Tainted: G
>>>>> E      6.8.10-1.gdc.el9.x86_64 #1
>>>>> [ 3049.262336] Hardware name: RDO OpenStack Compute/RHEL, BIOS
>>>>> edk2-20240214-2.el9 02/14/2024
>>>>> [ 3049.263003] RIP: 0010:_raw_spin_lock_irqsave+0x19/0x40
>>>>> [ 3049.263487] Code: cc 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90
>>>>> 90 0f 1f 44 00 00 41 54 9c 41 5c fa 65 ff 05 a6 92 f5 42 31 c0 ba 01
>>>>> 00 00 00 <f0> 0f b1 17 75 0a 4c 89 e0 41 5c c3 cc cc cc cc 89 c6 e8 d0
>>>>> 07 00
>>>>> [ 3049.264882] RSP: 0018:ffffb1bca6b9bd00 EFLAGS: 00010046
>>>>> [ 3049.265365] RAX: 0000000000000000 RBX: 66fb103e19e9c989 RCX: 0000000000000001
>>>>> [ 3049.265953] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 66fb103e19e9cc89
>>>>> [ 3049.266542] RBP: ffffffffc15df280 R08: 0000000000000001 R09: ffffa049a1785cb8
>>>>> [ 3049.267112] R10: ffffb1bca6b9bd70 R11: ffffa04964e49000 R12: 0000000000000246
>>>>> [ 3049.267702] R13: 66fb103e19e9cc89 R14: ffffa048445590a0 R15: 0000000000000001
>>>>> [ 3049.268278] FS:  00007fa3ddf03740(0000) GS:ffffa05703d00000(0000)
>>>>> knlGS:0000000000000000
>>>>> [ 3049.268928] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [ 3049.269443] CR2: 00007fa3dddfca50 CR3: 0000000342d1e004 CR4: 0000000000770ef0
>>>>> [ 3049.270025] PKRU: 55555554
>>>>> [ 3049.270371] Call Trace:
>>>>> [ 3049.270723]  <TASK>
>>>>> [ 3049.271035]  ? die_addr+0x33/0x90
>>>>> [ 3049.271423]  ? exc_general_protection+0x1ea/0x450
>>>>> [ 3049.271879]  ? asm_exc_general_protection+0x22/0x30
>>>>> [ 3049.272344]  ? _raw_spin_lock_irqsave+0x19/0x40
>>>>> [ 3049.272803]  __percpu_counter_sum+0xd/0x70
>>>>> [ 3049.273219]  nfsd_show+0x4f/0x1d0 [nfsd]
>>>>> [ 3049.273666]  seq_read_iter+0x11d/0x4d0
>>>>> [ 3049.274073]  ? avc_has_perm+0x42/0xc0
>>>>> [ 3049.274489]  seq_read+0xfe/0x140
>>>>> [ 3049.274866]  proc_reg_read+0x56/0xa0
>>>>> [ 3049.275257]  vfs_read+0xa7/0x340
>>>>> [ 3049.275647]  ? __do_sys_newfstat+0x57/0x60
>>>>> [ 3049.276059]  ksys_read+0x5f/0xe0
>>>>> [ 3049.276439]  do_syscall_64+0x5e/0x170
>>>>> [ 3049.276836]  entry_SYSCALL_64_after_hwframe+0x78/0x80
>>>>> [ 3049.277296] RIP: 0033:0x7fa3ddcfd9b2
>>>>> [ 3049.277719] Code: c0 e9 b2 fe ff ff 50 48 8d 3d ea 1d 0c 00 e8 c5
>>>>> fd 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75
>>>>> 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89
>>>>> 54 24
>>>>> [ 3049.279139] RSP: 002b:00007ffd930672e8 EFLAGS: 00000246 ORIG_RAX:
>>>>> 0000000000000000
>>>>> [ 3049.279788] RAX: ffffffffffffffda RBX: 0000555ded47c2a0 RCX: 00007fa3ddcfd9b2
>>>>> [ 3049.280402] RDX: 0000000000000400 RSI: 0000555ded47c480 RDI: 0000000000000003
>>>>> [ 3049.281046] RBP: 00007fa3dddf75e0 R08: 0000000000000003 R09: 0000000000000077
>>>>> [ 3049.281673] R10: 000000000000005d R11: 0000000000000246 R12: 0000555ded47c2a0
>>>>> [ 3049.282307] R13: 0000000000000d68 R14: 00007fa3dddf69e0 R15: 0000000000000d68
>>>>> [ 3049.282928]  </TASK>
>>>>> [ 3049.283310] Modules linked in: mptcp_diag(E) xsk_diag(E)
>>>>> raw_diag(E) unix_diag(E) af_packet_diag(E) netlink_diag(E) udp_diag(E)
>>>>> tcp_diag(E) inet_diag(E) tun(E) br_netfilter(E) bridge(E) stp(E)
>>>>> llc(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E)
>>>>> nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) binfmt_misc(E)
>>>>> zram(E) tls(E) isofs(E) vfat(E) fat(E) intel_rapl_msr(E)
>>>>> intel_rapl_common(E) kvm_amd(E) ccp(E) kvm(E) irqbypass(E)
>>>>> virtio_net(E) i2c_i801(E) virtio_gpu(E) i2c_smbus(E) net_failover(E)
>>>>> virtio_balloon(E) failover(E) virtio_dma_buf(E) fuse(E) ext4(E)
>>>>> mbcache(E) jbd2(E) sr_mod(E) cdrom(E) sg(E) ahci(E) libahci(E)
>>>>> crct10dif_pclmul(E) crc32_pclmul(Ea) polyval_clmulni(E)
>>>>> polyval_generic(E) libata(E) ghash_clmulni_intel(E) sha512_ssse3(E)
>>>>> virtio_blk(E) serio_raw(E) btrfs(E) xor(E) zstd_compress(E)
>>>>> raid6_pq(E) libcrc32c(E) crc32c_intel(E) dm_mirror(E)
>>>>> dm_region_hash(E) dm_log(E) dm_mod(E)
>>>>> [ 3049.283345] Unloaded tainted modules: edac_mce_amd(E):1 padlock_aes(E)
>>>>>
>>>>> Any suggestion on how to fix it is appreciated.
>>>>
>>>> Bisect between v6.8.9 and v6.8.10 would give us the exact point
>>>> where the failures were introduced.
>>>>
>>>> I see that GregKH pulled in:
>>>>
>>>> 26a0ddb04230 ("nfsd: rename NFSD_NET_* to NFSD_STATS_*")
>>>> b7b05f98f3f0 ("nfsd: expose /proc/net/sunrpc/nfsd in net namespaces")
>>>> abf5fb593c90 ("nfsd: make all of the nfsd stats per-network namespace")
>>>>
>>>> for v6.8.10 as a Stable-Dep-of: 18180a4550d0 ("NFSD: Fix nfsd4_encode_fattr4() crasher")
>>>>
>>>> Which is a little baffling, I don't see how those two change sets
>>>> are mechanically related to each other. But I suspect the culprit is
>>>> one of those three stat-related patches.
>>>>
>>>>
>>>> --
>>>> Chuck Lever
>>>
>>>
>>> Hello,
>>>
>>> I run bisecting. It was easy to reproduce, simple execution of
>>> "nfsstat" from terminal stuck the server:
>>>
>>> abf5fb593c90d3ab55d6cf1dea7bec8ee0bf3566 is the first bad commit
>>>
>>>
>>> $ git bisect bad
>>> abf5fb593c90d3ab55d6cf1dea7bec8ee0bf3566 is the first bad commit
>>> commit abf5fb593c90d3ab55d6cf1dea7bec8ee0bf3566 (HEAD)
>>> Author: Josef Bacik <josef@xxxxxxxxxxxxxx>
>>> Date:   Fri Jan 26 10:39:47 2024 -0500
>>>
>>>    nfsd: make all of the nfsd stats per-network namespace
>>>
>>>    [ Upstream commit 4b14885411f74b2b0ce0eb2b39d0fffe54e5ca0d ]
>>>
>>>    We have a global set of counters that we modify for all of the nfsd
>>>    operations, but now that we're exposing these stats across all network
>>>    namespaces we need to make the stats also be per-network namespace.  We
>>>    already have some caching stats that are per-network namespace, so move
>>>    these definitions into the same counter and then adjust all the helpers
>>>    and users of these stats to provide the appropriate nfsd_net struct so
>>>    that the stats are maintained for the per-network namespace objects.
>>>
>>>    Signed-off-by: Josef Bacik <josef@xxxxxxxxxxxxxx>
>>>    Reviewed-by: Jeff Layton <jlayton@xxxxxxxxxx>
>>>    Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>>>    Stable-dep-of: 18180a4550d0 ("NFSD: Fix nfsd4_encode_fattr4() crasher")
>>>    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
>>>
>>> fs/nfsd/cache.h     |  2 --
>>> fs/nfsd/netns.h     | 17 +++++++++++++++--
>>> fs/nfsd/nfs4proc.c  |  6 +++---
>>> fs/nfsd/nfs4state.c |  3 ++-
>>> fs/nfsd/nfscache.c  | 36 +++++++-----------------------------
>>> fs/nfsd/nfsctl.c    | 12 +++---------
>>> fs/nfsd/nfsfh.c     |  3 ++-
>>> fs/nfsd/stats.c     | 26 ++++++++++++++------------
>>> fs/nfsd/stats.h     | 54 +++++++++++++++++++-----------------------------------
>>> fs/nfsd/vfs.c       |  6 ++++--
>>> 10 files changed, 69 insertions(+), 96 deletions(-)
>>>
>>> $ git bisect log
>>> git bisect start
>>> # status: waiting for both good and bad commits
>>> # good: [f3d61438b613b87afb63118bea6fb18c50ba7a6b] Linux 6.8.9
>>> git bisect good f3d61438b613b87afb63118bea6fb18c50ba7a6b
>>> # status: waiting for bad commit, 1 good commit known
>>> # bad: [a0c69a570e420e86c7569b8c052913213eef2b45] Linux 6.8.10
>>> git bisect bad a0c69a570e420e86c7569b8c052913213eef2b45
>>> # bad: [4aaed9dbe8acd2b6114458f0498a617283d6275b] hv_netvsc: Don't
>>> free decrypted memory
>>> git bisect bad 4aaed9dbe8acd2b6114458f0498a617283d6275b
>>> # bad: [ee190d04c2f99c8e557b00e997621c04592baed1] net: gro: add flush
>>> check in udp_gro_receive_segment
>>> git bisect bad ee190d04c2f99c8e557b00e997621c04592baed1
>>> # bad: [781e34b736014188ba9e46a71535237313dcda81] efi/unaccepted:
>>> touch soft lockup during memory accept
>>> git bisect bad 781e34b736014188ba9e46a71535237313dcda81
>>> # bad: [6a7b07689af6e4e023404bf69b1230f43b2a15bc] NFSD: Fix
>>> nfsd4_encode_fattr4() crasher
>>> git bisect bad 6a7b07689af6e4e023404bf69b1230f43b2a15bc
>>> # good: [e05194baae299f2148ab5f6bab659c6ce8d1f6d3] nfs: expose
>>> /proc/net/sunrpc/nfs in net namespaces
>>> git bisect good e05194baae299f2148ab5f6bab659c6ce8d1f6d3
>>> # good: [946ab150335d92f852288c1c6b0f0466b5d6e97f] power: supply:
>>> mt6360_charger: Fix of_match for usb-otg-vbus regulator
>>> git bisect good 946ab150335d92f852288c1c6b0f0466b5d6e97f
>>> # good: [b7b05f98f3f06fea3986b46e5c7fe2928676b02d] nfsd: expose
>>> /proc/net/sunrpc/nfsd in net namespaces
>>> git bisect good b7b05f98f3f06fea3986b46e5c7fe2928676b02d
>>> # bad: [0e8003af77879572dbc1df56860cbe2bfa8498f0] NFSD: add support
>>> for CB_GETATTR callback
>>> git bisect bad 0e8003af77879572dbc1df56860cbe2bfa8498f0
>>> # bad: [abf5fb593c90d3ab55d6cf1dea7bec8ee0bf3566] nfsd: make all of
>>> the nfsd stats per-network namespace
>>> git bisect bad abf5fb593c90d3ab55d6cf1dea7bec8ee0bf3566
>>> # first bad commit: [abf5fb593c90d3ab55d6cf1dea7bec8ee0bf3566] nfsd:
>>> make all of the nfsd stats per-network namespace
>>
>> I built a full 6.8.10 with reverted single commit
>> "abf5fb593c90d3ab55d6cf1dea7bec8ee0bf3566". The server does not get
>> stuck when calling "nfsstat".
> 
> Good to know, but I don't think it's entirely safe to revert
> only that patch -- all three would have to come off.
> 
> I can't seem to get nfsstat to trigger a problem on my
> server.
> 
> 
> --
> Chuck Lever
> 
> 




[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux