gather-facts triggers a kernel panic with centos stream kernel 4.18.0-358.el8.x86_64

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

I am pretty sure that this is a kernel issue related to centos stream and
probably Dell PowerEdge C6420, but I want to let you know about this it
just in case someone is going to upgrade centos stream to the latest kernel
4.18.0-358.el8.x86_64 and finds the same problem.

Yesterday I was investigating a strange issue, where after upgrading the OS
(Centos Stream) the nodes were rebooting each 2 hours.

Looking at the cephadm logs we got errors of the type:

cephadm [ERR] Failed to execute command: /usr/bin/python3
/var/lib/ceph/c6e89d30-de52-11eb-a76f-bc97e1e57d70/cephadm.f46dc95b01feeedb28941a48e2f1d0abb51139ca828de11150ea7122a8e3549c
gather-facts

And I was able to reproduce the issue just running the python program in
one of the updating nodes to gather-facts:

root@c27-35 ~]# /usr/bin/python3 debug.py gather-facts
> /root/debug.py(7004)command_gather_facts()
-> host = HostFacts(ctx)
(Pdb) s


--Call--
> /root/debug.py(6498)__init__()
-> def __init__(self, ctx: CephadmContext):
....
(Pdb) n
> /root/debug.py(6509)__init__()
-> self.arch: str = platform.processor()
(Pdb) n
> /root/debug.py(6510)__init__()
-> self.kernel: str = platform.release()
(Pdb) n
--Return--
> /root/debug.py(6510)__init__()->None
-> self.kernel: str = platform.release()
(Pdb)
> /root/debug.py(7005)command_gather_facts()
-> print(host.dump())
(Pdb)
client_loop: send disconnect: Broken pipe

And when it reaches the host.dump() the server hangs with the following
kernel panic:

[  572.332036] BUG: unable to handle kernel paging request at
0000559e8860e740
[  572.415388] PGD 1a62022067 P4D 1a62022067 PUD 1a62023067 PMD 1868b7a067
PTE 80000018e6ae8867
[  572.516372] Oops: 0003 [#1] SMP NOPTI
[  572.560156] CPU: 41 PID: 8408 Comm: sysctl Kdump: loaded Tainted: G
     I      --------- -  - 4.18.0-358.el8.x86_64 #1
[  572.693381] Hardware name: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.12.2
07/14/2021
[  572.785007] RIP: 0010:memcpy_erms+0x6/0x10
[  572.833991] Code: 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9
03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1
<f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 f e
[  573.058737] RSP: 0018:ffffb2fd0d1ebe28 EFLAGS: 00010297
[  573.121242] RAX: 0000559e8860e740 RBX: 0000000000000002 RCX:
0000000000000002
[  573.206628] RDX: 0000000000000002 RSI: ffffb2fd0d1ebe37 RDI:
0000559e8860e740
[  573.292013] RBP: ffffb2fd0d1ebf08 R08: 0000000000000000 R09:
0000000000000000
[  573.377397] R10: ffffb2fd0d1ebe80 R11: ffffb2fd0d1ebe38 R12:
ffffb2fd0d1ebe80
[  573.462781] R13: 0000559e8860e740 R14: 0000000000000002 R15:
ffffffffc14d6e00
[  573.548168] FS:  00007fba08251940(0000) GS:ffff8b46df700000(0000)
knlGS:0000000000000000
[  573.644993] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  573.713737] CR2: 0000559e8860e740 CR3: 000000184f3e4002 CR4:
00000000007706e0
[  573.799122] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[  573.884507] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[  573.969892] PKRU: 55555554
[  574.002235] Call Trace:
[  574.031462]  svcrdma_counter_handler+0xc1/0x110 [rpcrdma]
[  574.096045]  proc_sys_call_handler+0x1a5/0x1c0
[  574.149191]  vfs_read+0x91/0x140
[  574.187773]  ksys_read+0x4f/0xb0
[  574.226358]  do_syscall_64+0x5b/0x1a0
[  574.270142]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[  574.330566] RIP: 0033:0x7fba0761b555
[  574.373312] Code: fe ff ff 50 48 8d 3d 22 c9 06 00 e8 25 ed 01 00 0f 1f
44 00 00 f3 0f 1e fa 48 8d 05 45 40 2a 00 8b 00 85 c0 75 0f 31 c0 0f 05
<48> 3d 00 f0 ff ff 77 53 c3 66 90 41 54 49 89 d4 55 48 89 f5 53 8 9

[  574.598058] RSP: 002b:00007ffdf1b482d8 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[  574.688641] RAX: ffffffffffffffda RBX: 0000559e8860e190 RCX:
00007fba0761b555
[  574.774027] RDX: 0000000000002000 RSI: 0000559e8860e740 RDI:
0000000000000006
[  574.859412] RBP: 0000000000000d68 R08: 0000559e88610740 R09:
0000000000000003
[  574.944798] R10: 0000000000000001 R11: 0000000000000246 R12:
0000000000002000
[  575.030181] R13: 0000559e88610750 R14: 0000000000000000 R15:
0000000000000000
[  575.115568] Modules linked in: joydev sch_fq binfmt_misc overlay 8021q
garp mrp stp llc rpcrdma intel_rapl_msr intel_rapl_common sunrpc rdma_ucm
ib_srpt ib_isert isst_if_common iscsi_target_mod target_core_mod
ib_iser libiscsi scsi_transport_iscsi bonding skx_edac rdma_cm ib_umad nfit
ib_ipoib iw_cm libnvdimm x86_pkg_temp_thermal intel_powerclamp ib_cm
coretemp kvm_intel kvm dell_smbios irqbypass iTCO_wdt mlx5_ib crct10
dif_pclmul crc32_pclmul iTCO_vendor_support dell_wmi_descriptor wmi_bmof
ib_uverbs dcdbas ghash_clmulni_intel rapl mei_me intel_cstate i2c_i801
lpc_ich pcspkr ib_core mei intel_uncore wmi ipmi_ssif acpi_power_mete
r vfat fat ip_vs ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 drm_kms_helper
mlx5_core syscopyarea sysfillrect sysimgblt fb_sys_fops ahci mlxfw libahci
pci_hyperv_intf drm tls megaraid_sas libata psample i2c_algo_bi
t openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 libcrc32c
crc32c_intel nf_defrag_ipv4 ipmi_si ipmi_devintf ipmi_msghandler fuse
[  576.150586] CR2: 0000559e8860e740

This is the affected version:
[    0.000000] Linux version 4.18.0-358.el8.x86_64 (
mockbuild@xxxxxxxxxxxxxxxxxxxxxxxx) (gcc version 8.5.0 20210514 (Red Hat
8.5.0-7) (GCC)) #1 SMP Mon Jan 10 13:11:20 UTC 2022
[    0.000000] Command line: elfcorehdr=0x38000000
BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-4.18.0-358.el8.x86_64 ro
resume=UUID=0de0c20d-9d0a-447e-981c-898b54be9f48 console=ttyS0 irqpoll
nr_cpus=1 reset_devices cgroup_
disable=memory mce=off numa=off udev.children-max=2 panic=10
rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr
novmcoredd hest_disable disable_cpu_apicid=0 iTCO_wdt.pretimeout=0
trace_buf_size =1

Everything went to normal going back to the previous kernel version (no
need to downgrade any other package).

I also verified that the 348 version of the kernel
(4.18.0-348.2.1.el8_5.x86_64) works fine, so we left that one for the
moment.

I hope this is useful to others that could experience the same problem.

Kind regards,
Javier
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux