Re: gather-facts triggers a kernel panic with centos stream kernel 4.18.0-358.el8.x86_64

Paul Cuzner <pcuzner@xxxxxxxxxx> · Wed, 19 Jan 2022 14:39:06 +1300

Hi Javier,
I can't reproduce this locally, so the server and it's BIOS could be a factor.

gather facts grabs data from sysfs (/sys/class/dmi/id), so we could start there.

can you try issuing a cat against the following entries in the above path?

sys_vendor
product_family
product_name
bios_version
bios_date

Thanks,
PC

On Tue, Jan 18, 2022 at 11:32 PM Javier Cacheiro
<Javier.Cacheiro@xxxxxxxxx> wrote:
>
> Hi all,
>
> I am pretty sure that this is a kernel issue related to centos stream and
> probably Dell PowerEdge C6420, but I want to let you know about this it
> just in case someone is going to upgrade centos stream to the latest kernel
> 4.18.0-358.el8.x86_64 and finds the same problem.
>
> Yesterday I was investigating a strange issue, where after upgrading the OS
> (Centos Stream) the nodes were rebooting each 2 hours.
>
> Looking at the cephadm logs we got errors of the type:
>
> cephadm [ERR] Failed to execute command: /usr/bin/python3
> /var/lib/ceph/c6e89d30-de52-11eb-a76f-bc97e1e57d70/cephadm.f46dc95b01feeedb28941a48e2f1d0abb51139ca828de11150ea7122a8e3549c
> gather-facts
>
> And I was able to reproduce the issue just running the python program in
> one of the updating nodes to gather-facts:
>
> root@c27-35 ~]# /usr/bin/python3 debug.py gather-facts
> > /root/debug.py(7004)command_gather_facts()
> -> host = HostFacts(ctx)
> (Pdb) s
>
>
> --Call--
> > /root/debug.py(6498)__init__()
> -> def __init__(self, ctx: CephadmContext):
> ....
> (Pdb) n
> > /root/debug.py(6509)__init__()
> -> self.arch: str = platform.processor()
> (Pdb) n
> > /root/debug.py(6510)__init__()
> -> self.kernel: str = platform.release()
> (Pdb) n
> --Return--
> > /root/debug.py(6510)__init__()->None
> -> self.kernel: str = platform.release()
> (Pdb)
> > /root/debug.py(7005)command_gather_facts()
> -> print(host.dump())
> (Pdb)
> client_loop: send disconnect: Broken pipe
>
> And when it reaches the host.dump() the server hangs with the following
> kernel panic:
>
> [  572.332036] BUG: unable to handle kernel paging request at
> 0000559e8860e740
> [  572.415388] PGD 1a62022067 P4D 1a62022067 PUD 1a62023067 PMD 1868b7a067
> PTE 80000018e6ae8867
> [  572.516372] Oops: 0003 [#1] SMP NOPTI
> [  572.560156] CPU: 41 PID: 8408 Comm: sysctl Kdump: loaded Tainted: G
>      I      --------- -  - 4.18.0-358.el8.x86_64 #1
> [  572.693381] Hardware name: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.12.2
> 07/14/2021
> [  572.785007] RIP: 0010:memcpy_erms+0x6/0x10
> [  572.833991] Code: 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9
> 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1
> <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 f e
> [  573.058737] RSP: 0018:ffffb2fd0d1ebe28 EFLAGS: 00010297
> [  573.121242] RAX: 0000559e8860e740 RBX: 0000000000000002 RCX:
> 0000000000000002
> [  573.206628] RDX: 0000000000000002 RSI: ffffb2fd0d1ebe37 RDI:
> 0000559e8860e740
> [  573.292013] RBP: ffffb2fd0d1ebf08 R08: 0000000000000000 R09:
> 0000000000000000
> [  573.377397] R10: ffffb2fd0d1ebe80 R11: ffffb2fd0d1ebe38 R12:
> ffffb2fd0d1ebe80
> [  573.462781] R13: 0000559e8860e740 R14: 0000000000000002 R15:
> ffffffffc14d6e00
> [  573.548168] FS:  00007fba08251940(0000) GS:ffff8b46df700000(0000)
> knlGS:0000000000000000
> [  573.644993] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  573.713737] CR2: 0000559e8860e740 CR3: 000000184f3e4002 CR4:
> 00000000007706e0
> [  573.799122] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [  573.884507] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [  573.969892] PKRU: 55555554
> [  574.002235] Call Trace:
> [  574.031462]  svcrdma_counter_handler+0xc1/0x110 [rpcrdma]
> [  574.096045]  proc_sys_call_handler+0x1a5/0x1c0
> [  574.149191]  vfs_read+0x91/0x140
> [  574.187773]  ksys_read+0x4f/0xb0
> [  574.226358]  do_syscall_64+0x5b/0x1a0
> [  574.270142]  entry_SYSCALL_64_after_hwframe+0x65/0xca
> [  574.330566] RIP: 0033:0x7fba0761b555
> [  574.373312] Code: fe ff ff 50 48 8d 3d 22 c9 06 00 e8 25 ed 01 00 0f 1f
> 44 00 00 f3 0f 1e fa 48 8d 05 45 40 2a 00 8b 00 85 c0 75 0f 31 c0 0f 05
> <48> 3d 00 f0 ff ff 77 53 c3 66 90 41 54 49 89 d4 55 48 89 f5 53 8 9
>
> [  574.598058] RSP: 002b:00007ffdf1b482d8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000000
> [  574.688641] RAX: ffffffffffffffda RBX: 0000559e8860e190 RCX:
> 00007fba0761b555
> [  574.774027] RDX: 0000000000002000 RSI: 0000559e8860e740 RDI:
> 0000000000000006
> [  574.859412] RBP: 0000000000000d68 R08: 0000559e88610740 R09:
> 0000000000000003
> [  574.944798] R10: 0000000000000001 R11: 0000000000000246 R12:
> 0000000000002000
> [  575.030181] R13: 0000559e88610750 R14: 0000000000000000 R15:
> 0000000000000000
> [  575.115568] Modules linked in: joydev sch_fq binfmt_misc overlay 8021q
> garp mrp stp llc rpcrdma intel_rapl_msr intel_rapl_common sunrpc rdma_ucm
> ib_srpt ib_isert isst_if_common iscsi_target_mod target_core_mod
> ib_iser libiscsi scsi_transport_iscsi bonding skx_edac rdma_cm ib_umad nfit
> ib_ipoib iw_cm libnvdimm x86_pkg_temp_thermal intel_powerclamp ib_cm
> coretemp kvm_intel kvm dell_smbios irqbypass iTCO_wdt mlx5_ib crct10
> dif_pclmul crc32_pclmul iTCO_vendor_support dell_wmi_descriptor wmi_bmof
> ib_uverbs dcdbas ghash_clmulni_intel rapl mei_me intel_cstate i2c_i801
> lpc_ich pcspkr ib_core mei intel_uncore wmi ipmi_ssif acpi_power_mete
> r vfat fat ip_vs ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 drm_kms_helper
> mlx5_core syscopyarea sysfillrect sysimgblt fb_sys_fops ahci mlxfw libahci
> pci_hyperv_intf drm tls megaraid_sas libata psample i2c_algo_bi
> t openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 libcrc32c
> crc32c_intel nf_defrag_ipv4 ipmi_si ipmi_devintf ipmi_msghandler fuse
> [  576.150586] CR2: 0000559e8860e740
>
> This is the affected version:
> [    0.000000] Linux version 4.18.0-358.el8.x86_64 (
> mockbuild@xxxxxxxxxxxxxxxxxxxxxxxx) (gcc version 8.5.0 20210514 (Red Hat
> 8.5.0-7) (GCC)) #1 SMP Mon Jan 10 13:11:20 UTC 2022
> [    0.000000] Command line: elfcorehdr=0x38000000
> BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-4.18.0-358.el8.x86_64 ro
> resume=UUID=0de0c20d-9d0a-447e-981c-898b54be9f48 console=ttyS0 irqpoll
> nr_cpus=1 reset_devices cgroup_
> disable=memory mce=off numa=off udev.children-max=2 panic=10
> rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr
> novmcoredd hest_disable disable_cpu_apicid=0 iTCO_wdt.pretimeout=0
> trace_buf_size =1
>
> Everything went to normal going back to the previous kernel version (no
> need to downgrade any other package).
>
> I also verified that the 348 version of the kernel
> (4.18.0-348.2.1.el8_5.x86_64) works fine, so we left that one for the
> moment.
>
> I hope this is useful to others that could experience the same problem.
>
> Kind regards,
> Javier
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx