Re: kernel BUG at include/linux/ceph/decode.h:262

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ilya and Jeff,

it happened again and this time we got a little bit more. I'm not sure it is complete though:

Feb 13 19:27:56 sn311 kernel: ------------[ cut here ]------------
Feb 13 19:27:56 sn311 kernel: kernel BUG at include/linux/ceph/decode.h:262!
Feb 13 19:27:56 sn311 kernel: invalid opcode: 0000 [#1] SMP
Feb 13 19:27:56 sn311 kernel: Modules linked in: squashfs loop overlay(T) 8021q garp mrp stp llc nfsv3 nfs_acl nfs lockd grace fscache beegfs(OE) ceph libceph libcrc32c dns_resolver rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib
(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) dcdbas mlx4_core(OE) amd64_edac_mod edac_mce_amd kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ab
lk_helper cryptd pcspkr sg mgag200 ahci ttm libahci drm_kms_helper libata k10temp syscopyarea sysfillrect ipmi_si sysimgblt ipmi_devintf fb_sys_fops ipmi_msghandler drm drm_panel_orientation_quirks i2c_piix4 ccp acpi_power_meter acpi_cp
ufreq sunrpc knem(OE) ip_tables mlx5_ib(OE) megaraid_sas ib_uverbs(OE) ib_core(OE) mlx5_core(OE) mlxfw(OE) devlink mlx_compat(OE)
Feb 13 19:27:56 sn311 kernel: igb i2c_algo_bit ixgbe dca ptp pps_core mdio sd_mod crc_t10dif crct10dif_common
Feb 13 19:27:56 sn311 kernel: CPU: 22 PID: 127515 Comm: octave-cli Tainted: G           OE  ------------ T 3.10.0-957.12.2.el7.x86_64 #1
Feb 13 19:27:56 sn311 kernel: Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.8.7 04/02/2019
Feb 13 19:27:56 sn311 kernel: task: ffff9d43adee2080 ti: ffff9d43aeb60000 task.ti: ffff9d43aeb60000
Feb 13 19:27:56 sn311 kernel: RIP: 0010:[<ffffffffc09ddca8>]  [<ffffffffc09ddca8>] ceph_encode_filepath.part.26+0x4/0x6 [ceph]
Feb 13 19:27:56 sn311 kernel: RSP: 0018:ffff9d43aeb63ae8  EFLAGS: 00010293
Feb 13 19:27:56 sn311 kernel: RAX: ffff9d3f91f46748 RBX: ffff9d43ad4ac800 RCX: ffff9d3f91f4673b
Feb 13 19:27:56 sn311 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9d3f91f46725
Feb 13 19:27:56 sn311 kernel: RBP: ffff9d43aeb63ae8 R08: 0000000000000000 R09: 0000000000000000
Feb 13 19:27:56 sn311 kernel: R10: ffff9d34bfc075c0 R11: 000000000000002c R12: ffff9d43adc81c20
Feb 13 19:27:56 sn311 kernel: R13: 0000000000000000 R14: ffff9d4af08c2000 R15: ffff9d3f91f466c0
Feb 13 19:27:56 sn311 kernel: FS:  00002aaab4a24d00(0000) GS:ffff9d43afa80000(0000) knlGS:00000000f7bf7700
Feb 13 19:27:56 sn311 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 13 19:27:56 sn311 kernel: CR2: 00002aaaad488800 CR3: 000000106c85c000 CR4: 00000000003407e0
Feb 13 19:27:56 sn311 kernel: Call Trace:
Feb 13 19:27:56 sn311 kernel: [<ffffffffc09d6fab>] __prepare_send_request+0x7cb/0x830 [ceph]
Feb 13 19:27:56 sn311 kernel: [<ffffffffc09d7352>] __do_request+0x342/0x430 [ceph]
Feb 13 19:27:56 sn311 kernel: [<ffffffffc09d8a8d>] ceph_mdsc_do_request+0x9d/0x280 [ceph]

The log ends here. We don't really have much options with this deployment. It is slim HPC compute nodes and we can't change their configuration easily. I'm afraid we have to wait for an event with more information pushed to logs.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Jeff Layton <jlayton@xxxxxxxxxx>
Sent: 01 February 2022 13:19:34
To: Frank Schilder; Ilya Dryomov
Cc: ceph-users
Subject: Re:  kernel BUG at include/linux/ceph/decode.h:262

Sounds good. We unfortunately can't tell much from the info below. You
may want to consider turning on kdump. You'll sacrifice a little RAM to
run it, but it would allow you to collect a core dump if the machine
crashes.

Also, fwiw, the RHEL7 ceph client is significantly behind where the
RHEL8 client is. If you're doing any sort of significant work with ceph,
you may want to migrate if you're able.

-- Jeff

On Tue, 2022-02-01 at 11:49 +0000, Frank Schilder wrote:
> Hi Ilya,
>
> I'm afraid this is all we have from the time of the crash:
>
> Jan 25 21:18:42 sn319 kernel: beegfs: enabling unsafe global rkey
> Jan 25 23:33:47 sn319 kernel: CPU: 12 PID: 123399 Comm: octave-cli Tainted: G           OE  ------------ T 3.10.0-957.1
> 2.2.el7.x86_64 #1
> Jan 25 23:33:47 sn319 kernel: ------------[ cut here ]------------
> Jan 25 23:33:47 sn319 kernel: Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.15.0 09/11/2020
> Jan 25 23:33:47 sn319 kernel: igb i2c_algo_bit ixgbe dca ptp pps_core mdio sd_mod crc_t10dif crct10dif_common
> Jan 25 23:33:47 sn319 kernel: invalid opcode: 0000 [#1] SMP
> Jan 25 23:33:47 sn319 kernel: kernel BUG at include/linux/ceph/decode.h:262!
> Jan 25 23:33:47 sn319 kernel: Modules linked in: squashfs loop overlay(T) 8021q garp mrp stp llc nfsv3 nfs_acl nfs lockd grace fscache beegfs(OE) ceph libceph libcrc32c dns_resolver rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) dcdbas mlx4_core(OE) amd64_edac_mod edac_mce_amd kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sg ahci libahci mgag200 ttm libata ccp i2c_piix4 drm_kms_helper k10temp syscopyarea sysfillrect sysimgblt fb_sys_fops drm ipmi_si ipmi_devintf ipmi_msghandler drm_panel_orientation_quirks acpi_power_meter acpi_cpufreq sunrpc knem(OE) ip_tables mlx5_ib(OE) megaraid_sas ib_uverbs(OE) ib_core(OE) mlx5_core(OE) mlxfw(OE) devlink mlx_compat(OE)
> Jan 29 15:33:18 sn319.hpc.ait.dtu.dk wwlogger: Running provision script: adhoc-pre
>
> This is a stateless deployment and the node crashed hard. syslog was not able to push more lines to the log server, either due to network coming off-line or everything got stopped. Next time we see this problem, we will try to access a crashed node and hope we can pull out more information.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Ilya Dryomov <idryomov@xxxxxxxxx>
> Sent: 31 January 2022 17:16:01
> To: Frank Schilder
> Cc: ceph-users; Jeff Layton
> Subject: Re:  kernel BUG at include/linux/ceph/decode.h:262
>
> On Mon, Jan 31, 2022 at 5:07 PM Frank Schilder <frans@xxxxxx> wrote:
> >
> > Hi all,
> >
> > we observed server crashes with these possibly related error messages in the log showing up:
> >
> > Jan 26 10:07:53 sn180 kernel: kernel BUG at include/linux/ceph/decode.h:262!
> > Jan 25 23:33:47 sn319 kernel: kernel BUG at include/linux/ceph/decode.h:262!
> > Jan 25 16:32:37 sn323 kernel: kernel BUG at include/linux/ceph/decode.h:262!
> > Jan 25 14:05:07 sn328 kernel: kernel BUG at include/linux/ceph/decode.h:262!
> > Jan 26 18:47:40 sn369 kernel: kernel BUG at include/linux/ceph/decode.h:262!
> > Jan 27 21:43:25 sn376 kernel: kernel BUG at include/linux/ceph/decode.h:262!
> > Jan 28 09:11:00 sn424 kernel: kernel BUG at include/linux/ceph/decode.h:262!
>
> The BUG appears to be
>
>     BUG_ON(*p + 1 + sizeof(ino) + sizeof(len) + len > end);
>
> in ceph_encode_filepath().
>
> >
> > The crash repost says:
> >
> > Jan 25 23:33:47 sn319 kernel: ------------[ cut here ]------------
> > Jan 25 23:33:47 sn319 kernel: kernel BUG at include/linux/ceph/decode.h:262!
> > Jan 25 23:33:47 sn319 kernel: invalid opcode: 0000 [#1] SMP
> > Jan 25 23:33:47 sn319 kernel: Modules linked in: squashfs loop overlay(T) 8021q garp mrp stp llc nfsv3 nfs_acl nfs lockd grace fscache beegfs(OE) ceph libceph libcrc32c dns_resolver rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib
> > (OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx4_en(OE) mlx4_ib(OE) dcdbas mlx4_core(OE) amd64_edac_mod edac_mce_amd kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ab
> > lk_helper cryptd pcspkr sg ahci libahci mgag200 ttm libata ccp i2c_piix4 drm_kms_helper k10temp syscopyarea sysfillrect sysimgblt fb_sys_fops drm ipmi_si ipmi_devintf ipmi_msghandler drm_panel_orientation_quirks acpi_power_meter acpi_cp
> > ufreq sunrpc knem(OE) ip_tables mlx5_ib(OE) megaraid_sas ib_uverbs(OE) ib_core(OE) mlx5_core(OE) mlxfw(OE) devlink mlx_compat(OE)
> > Jan 25 23:33:47 sn319 kernel: igb i2c_algo_bit ixgbe dca ptp pps_core mdio sd_mod crc_t10dif crct10dif_common
> > Jan 25 23:33:47 sn319 kernel: CPU: 12 PID: 123399 Comm: octave-cli Tainted: G           OE  ------------ T 3.10.0-957.12.2.el7.x86_64 #1
> > Jan 25 23:33:47 sn319 kernel: Hardware name: Dell Inc. PowerEdge R7425/08V001, BIOS 1.15.0 09/11/2020
>
> What about a stack trace that should follow here?
>
> Thanks,
>
>                 Ilya

--
Jeff Layton <jlayton@xxxxxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux