Hi Javier, I can't reproduce this locally, so the server and it's BIOS could be a factor. gather facts grabs data from sysfs (/sys/class/dmi/id), so we could start there. can you try issuing a cat against the following entries in the above path? sys_vendor product_family product_name bios_version bios_date Thanks, PC On Tue, Jan 18, 2022 at 11:32 PM Javier Cacheiro <Javier.Cacheiro@xxxxxxxxx> wrote: > > Hi all, > > I am pretty sure that this is a kernel issue related to centos stream and > probably Dell PowerEdge C6420, but I want to let you know about this it > just in case someone is going to upgrade centos stream to the latest kernel > 4.18.0-358.el8.x86_64 and finds the same problem. > > Yesterday I was investigating a strange issue, where after upgrading the OS > (Centos Stream) the nodes were rebooting each 2 hours. > > Looking at the cephadm logs we got errors of the type: > > cephadm [ERR] Failed to execute command: /usr/bin/python3 > /var/lib/ceph/c6e89d30-de52-11eb-a76f-bc97e1e57d70/cephadm.f46dc95b01feeedb28941a48e2f1d0abb51139ca828de11150ea7122a8e3549c > gather-facts > > And I was able to reproduce the issue just running the python program in > one of the updating nodes to gather-facts: > > root@c27-35 ~]# /usr/bin/python3 debug.py gather-facts > > /root/debug.py(7004)command_gather_facts() > -> host = HostFacts(ctx) > (Pdb) s > > > --Call-- > > /root/debug.py(6498)__init__() > -> def __init__(self, ctx: CephadmContext): > .... > (Pdb) n > > /root/debug.py(6509)__init__() > -> self.arch: str = platform.processor() > (Pdb) n > > /root/debug.py(6510)__init__() > -> self.kernel: str = platform.release() > (Pdb) n > --Return-- > > /root/debug.py(6510)__init__()->None > -> self.kernel: str = platform.release() > (Pdb) > > /root/debug.py(7005)command_gather_facts() > -> print(host.dump()) > (Pdb) > client_loop: send disconnect: Broken pipe > > And when it reaches the host.dump() the server hangs with the following > kernel panic: > > [ 572.332036] BUG: unable to handle kernel paging request at > 0000559e8860e740 > [ 572.415388] PGD 1a62022067 P4D 1a62022067 PUD 1a62023067 PMD 1868b7a067 > PTE 80000018e6ae8867 > [ 572.516372] Oops: 0003 [#1] SMP NOPTI > [ 572.560156] CPU: 41 PID: 8408 Comm: sysctl Kdump: loaded Tainted: G > I --------- - - 4.18.0-358.el8.x86_64 #1 > [ 572.693381] Hardware name: Dell Inc. PowerEdge C6420/0YTVTT, BIOS 2.12.2 > 07/14/2021 > [ 572.785007] RIP: 0010:memcpy_erms+0x6/0x10 > [ 572.833991] Code: 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 > 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 > <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 f e > [ 573.058737] RSP: 0018:ffffb2fd0d1ebe28 EFLAGS: 00010297 > [ 573.121242] RAX: 0000559e8860e740 RBX: 0000000000000002 RCX: > 0000000000000002 > [ 573.206628] RDX: 0000000000000002 RSI: ffffb2fd0d1ebe37 RDI: > 0000559e8860e740 > [ 573.292013] RBP: ffffb2fd0d1ebf08 R08: 0000000000000000 R09: > 0000000000000000 > [ 573.377397] R10: ffffb2fd0d1ebe80 R11: ffffb2fd0d1ebe38 R12: > ffffb2fd0d1ebe80 > [ 573.462781] R13: 0000559e8860e740 R14: 0000000000000002 R15: > ffffffffc14d6e00 > [ 573.548168] FS: 00007fba08251940(0000) GS:ffff8b46df700000(0000) > knlGS:0000000000000000 > [ 573.644993] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 573.713737] CR2: 0000559e8860e740 CR3: 000000184f3e4002 CR4: > 00000000007706e0 > [ 573.799122] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 573.884507] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [ 573.969892] PKRU: 55555554 > [ 574.002235] Call Trace: > [ 574.031462] svcrdma_counter_handler+0xc1/0x110 [rpcrdma] > [ 574.096045] proc_sys_call_handler+0x1a5/0x1c0 > [ 574.149191] vfs_read+0x91/0x140 > [ 574.187773] ksys_read+0x4f/0xb0 > [ 574.226358] do_syscall_64+0x5b/0x1a0 > [ 574.270142] entry_SYSCALL_64_after_hwframe+0x65/0xca > [ 574.330566] RIP: 0033:0x7fba0761b555 > [ 574.373312] Code: fe ff ff 50 48 8d 3d 22 c9 06 00 e8 25 ed 01 00 0f 1f > 44 00 00 f3 0f 1e fa 48 8d 05 45 40 2a 00 8b 00 85 c0 75 0f 31 c0 0f 05 > <48> 3d 00 f0 ff ff 77 53 c3 66 90 41 54 49 89 d4 55 48 89 f5 53 8 9 > > [ 574.598058] RSP: 002b:00007ffdf1b482d8 EFLAGS: 00000246 ORIG_RAX: > 0000000000000000 > [ 574.688641] RAX: ffffffffffffffda RBX: 0000559e8860e190 RCX: > 00007fba0761b555 > [ 574.774027] RDX: 0000000000002000 RSI: 0000559e8860e740 RDI: > 0000000000000006 > [ 574.859412] RBP: 0000000000000d68 R08: 0000559e88610740 R09: > 0000000000000003 > [ 574.944798] R10: 0000000000000001 R11: 0000000000000246 R12: > 0000000000002000 > [ 575.030181] R13: 0000559e88610750 R14: 0000000000000000 R15: > 0000000000000000 > [ 575.115568] Modules linked in: joydev sch_fq binfmt_misc overlay 8021q > garp mrp stp llc rpcrdma intel_rapl_msr intel_rapl_common sunrpc rdma_ucm > ib_srpt ib_isert isst_if_common iscsi_target_mod target_core_mod > ib_iser libiscsi scsi_transport_iscsi bonding skx_edac rdma_cm ib_umad nfit > ib_ipoib iw_cm libnvdimm x86_pkg_temp_thermal intel_powerclamp ib_cm > coretemp kvm_intel kvm dell_smbios irqbypass iTCO_wdt mlx5_ib crct10 > dif_pclmul crc32_pclmul iTCO_vendor_support dell_wmi_descriptor wmi_bmof > ib_uverbs dcdbas ghash_clmulni_intel rapl mei_me intel_cstate i2c_i801 > lpc_ich pcspkr ib_core mei intel_uncore wmi ipmi_ssif acpi_power_mete > r vfat fat ip_vs ext4 mbcache jbd2 sd_mod t10_pi sg mgag200 drm_kms_helper > mlx5_core syscopyarea sysfillrect sysimgblt fb_sys_fops ahci mlxfw libahci > pci_hyperv_intf drm tls megaraid_sas libata psample i2c_algo_bi > t openvswitch nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 libcrc32c > crc32c_intel nf_defrag_ipv4 ipmi_si ipmi_devintf ipmi_msghandler fuse > [ 576.150586] CR2: 0000559e8860e740 > > This is the affected version: > [ 0.000000] Linux version 4.18.0-358.el8.x86_64 ( > mockbuild@xxxxxxxxxxxxxxxxxxxxxxxx) (gcc version 8.5.0 20210514 (Red Hat > 8.5.0-7) (GCC)) #1 SMP Mon Jan 10 13:11:20 UTC 2022 > [ 0.000000] Command line: elfcorehdr=0x38000000 > BOOT_IMAGE=(hd0,gpt2)/boot/vmlinuz-4.18.0-358.el8.x86_64 ro > resume=UUID=0de0c20d-9d0a-447e-981c-898b54be9f48 console=ttyS0 irqpoll > nr_cpus=1 reset_devices cgroup_ > disable=memory mce=off numa=off udev.children-max=2 panic=10 > rootflags=nofail acpi_no_memhotplug transparent_hugepage=never nokaslr > novmcoredd hest_disable disable_cpu_apicid=0 iTCO_wdt.pretimeout=0 > trace_buf_size =1 > > Everything went to normal going back to the previous kernel version (no > need to downgrade any other package). > > I also verified that the 348 version of the kernel > (4.18.0-348.2.1.el8_5.x86_64) works fine, so we left that one for the > moment. > > I hope this is useful to others that could experience the same problem. > > Kind regards, > Javier > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx