ceph kernel client RIP when quota exceeded

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Hi,

we experienced massive node failures when a user with cephfs quota exceeded submitted many jobs to a slurm cluster, home is on cephfs. The nodes still work for some time, but they eventually freeze due to too many stuck CPUs

Is this a kernel ceph client bug? running on 5.10.123, ceph cluster is 16.2.9.

Best regards,
Andrej

2022-08-15T20:08:01+02:00 cn0539 kernel: ------------[ cut here ]------------ 2022-08-15T20:08:01+02:00 cn0539 kernel: Attempt to access reserved inode number 0x101 2022-08-15T20:08:01+02:00 cn0539 kernel: WARNING: CPU: 172 PID: 4185848 at fs/ceph/super.h:547 __lookup_inode+0x161/0x180 [ceph] 2022-08-15T20:08:14+02:00 cn0539 kernel: Modules linked in: squashfs loop overlay fuse ceph libceph mgc(O) lustre(O) lmv(O) mdc(O) fid(O) lov(O) fld(O) osc(O) ko2iblnd(O) ptlrpc(O) obdclass(O) lnet(O) libcfs(O) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace nfs_ssc fscache rfkill ipmi_ssif nft_limit amd64_edac_mod edac_mce_amd amd_energy nft_ct kvm_amd nf_conntrack nf_defrag_ipv6 kvm nf_defrag_ipv4 irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl pcspkr nf_tables libcrc32c nfnetlink sp5100_tco ccp acpi_ipmi k10temp i2c_piix4 ipmi_si rdma_ucm(O) rdma_cm(O) iw_cm(O) acpi_cpufreq ib_ipoib(O) ib_cm(O) ib_umad(O) sunrpc vfat fat ext4 mbcache jbd2 mlx5_ib(O) ib_uverbs(O) ib_core(O) mlx5_core(O) mlxfw(O) pci_hyperv_intf crc32c_inte l tls ahci nvme psample igb libahci mlxdevm(O) auxiliary(O) nvme_core i2c_algo_bit libata t10_pi dca mlx_compat(O) pinctrl_amd xpmem(O) ipmi_devintf ipmi_msghandler 2022-08-15T20:08:14+02:00 cn0539 kernel: CPU: 172 PID: 4185848 Comm: slurm_script Tainted: G        W  O      5.10.123-2.el8.x86_64 #1 2022-08-15T20:08:16+02:00 cn0539 kernel: Hardware name: To be filled by O.E.M. To be filled by O.E.M./CER, BIOS BIOS_RME090.22.37.001 10/05/2021 2022-08-15T20:08:17+02:00 cn0539 kernel: RIP: 0010:__lookup_inode+0x161/0x180 [ceph] 2022-08-15T20:08:18+02:00 cn0539 kernel: Code: dd 48 85 db 0f 85 27 ff ff ff 45 85 e4 0f 89 5d ff ff ff 49 63 ec e9 16 ff ff ff 48 89 de 48 c7 c7 58 bb 40 c1 e8 1e 21 d8 d0 <0f> 0b e9 3f ff ff ff e8 53 3d 01 00 eb c6 be 03 00 00 00 e8 97 a2 2022-08-15T20:08:21+02:00 cn0539 kernel: RSP: 0018:ffffb6d8de33fc18 EFLAGS: 00010286 2022-08-15T20:08:22+02:00 cn0539 kernel: RAX: 0000000000000000 RBX: 0000000000000101 RCX: 0000000000000027 2022-08-15T20:08:23+02:00 cn0539 kernel: RDX: 0000000000000027 RSI: ffff95f2afd207e0 RDI: ffff95f2afd207e8 2022-08-15T20:08:24+02:00 cn0539 kernel: RBP: ffff965345e568a0 R08: 0000000000000000 R09: c0000000fffeffff 2022-08-15T20:08:25+02:00 cn0539 kernel: R10: 0000000000000001 R11: ffffb6d8de33fa20 R12: ffff959e55081aa8 2022-08-15T20:08:27+02:00 cn0539 kernel: R13: ffff965345e568a8 R14: ffff9593ea333e00 R15: ffff959e55081a80 2022-08-15T20:08:28+02:00 cn0539 kernel: FS:  00007fbf7c8ba740(0000) GS:ffff95f2afd00000(0000) knlGS:0000000000000000 2022-08-15T20:08:29+02:00 cn0539 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2022-08-15T20:08:30+02:00 cn0539 kernel: CR2: 0000564324b8a588 CR3: 0000004d51150000 CR4: 0000000000150ee0
2022-08-15T20:08:31+02:00 cn0539 kernel: Call Trace:
2022-08-15T20:08:31+02:00 cn0539 kernel: ? __do_request+0x3f0/0x450 [ceph]
2022-08-15T20:08:32+02:00 cn0539 kernel: ceph_lookup_inode+0xa/0x30 [ceph]
2022-08-15T20:08:34+02:00 cn0539 kernel: lookup_quotarealm_inode.isra.9+0x188/0x210 [ceph] 2022-08-15T20:08:34+02:00 cn0539 kernel: check_quota_exceeded+0x1bc/0x220 [ceph]
2022-08-15T20:08:34+02:00 cn0539 kernel: ceph_write_iter+0x1bf/0xc90 [ceph]
2022-08-15T20:08:35+02:00 cn0539 kernel: ? path_openat+0x666/0x1050
2022-08-15T20:08:36+02:00 cn0539 kernel: ? __touch_cap+0x1f/0xd0 [ceph]
2022-08-15T20:08:36+02:00 cn0539 kernel: ? ptep_set_access_flags+0x23/0x30
2022-08-15T20:08:37+02:00 cn0539 kernel: ? wp_page_reuse+0x5f/0x70
2022-08-15T20:08:38+02:00 cn0539 kernel: ? new_sync_write+0x11f/0x1b0
2022-08-15T20:08:38+02:00 cn0539 kernel: new_sync_write+0x11f/0x1b0
2022-08-15T20:08:39+02:00 cn0539 kernel: vfs_write+0x1bd/0x270
2022-08-15T20:08:40+02:00 cn0539 kernel: ksys_write+0x59/0xd0
2022-08-15T20:08:40+02:00 cn0539 kernel: do_syscall_64+0x33/0x40
2022-08-15T20:08:41+02:00 cn0539 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
2022-08-15T20:08:41+02:00 cn0539 kernel: RIP: 0033:0x7fbf7bfc65a8
2022-08-15T20:08:42+02:00 cn0539 kernel: Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 f5 3f 2a 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55 2022-08-15T20:08:45+02:00 cn0539 kernel: RSP: 002b:00007ffcc4ad6dd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 2022-08-15T20:08:46+02:00 cn0539 kernel: RAX: ffffffffffffffda RBX: 0000000000000417 RCX: 00007fbf7bfc65a8 2022-08-15T20:08:47+02:00 cn0539 kernel: RDX: 0000000000000417 RSI: 0000564324baa470 RDI: 0000000000000004 2022-08-15T20:08:48+02:00 cn0539 kernel: RBP: 0000564324baa470 R08: 0000000000000008 R09: 00224b5341545f52 2022-08-15T20:08:49+02:00 cn0539 kernel: R10: 0000000000000025 R11: 0000000000000246 R12: 0000564324b9cf50 2022-08-15T20:08:51+02:00 cn0539 kernel: R13: 0000000000000000 R14: 0000564324ba6200 R15: 0000564324b9cf50 2022-08-15T20:08:52+02:00 cn0539 kernel: ---[ end trace a655820d09b78154 ]--- 2022-08-15T20:09:58+02:00 cn0539 kernel: mlx5_core 0000:61:00.0: mlx5_cmd_out_err:800:(pid 4155261): MAD_IFC(0x50d) op_mod(0x0) failed, status bad packet (discarded)(0x30), syndrome (0xea9eb5), err(-22) 2022-08-15T20:09:58+02:00 cn0539 kernel: mlx5_core 0000:61:00.0: mlx5_cmd_out_err:800:(pid 4155261): MAD_IFC(0x50d) op_mod(0x0) failed, status bad packet (discarded)(0x30), syndrome (0xea9eb5), err(-22) 2022-08-15T20:10:12+02:00 cn0539 kernel: ------------[ cut here ]------------ 2022-08-15T20:10:12+02:00 cn0539 kernel: Attempt to access reserved inode number 0x101 2022-08-15T20:10:12+02:00 cn0539 kernel: WARNING: CPU: 78 PID: 14675 at fs/ceph/super.h:547 __lookup_inode+0x161/0x180 [ceph] 2022-08-15T20:10:26+02:00 cn0539 kernel: Modules linked in: squashfs loop overlay fuse ceph libceph mgc(O) lustre(O) lmv(O) mdc(O) fid(O) lov(O) fld(O) osc(O) ko2iblnd(O) ptlrpc(O) obdclass(O) lnet(O) libcfs(O) rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace nfs_ssc fscache rfkill ipmi_ssif nft_limit amd64_edac_mod edac_mce_amd amd_energy nft_ct kvm_amd nf_conntrack nf_defrag_ipv6 kvm nf_defrag_ipv4 irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel rapl pcspkr nf_tables libcrc32c nfnetlink sp5100_tco ccp acpi_ipmi k10temp i2c_piix4 ipmi_si rdma_ucm(O) rdma_cm(O) iw_cm(O) acpi_cpufreq ib_ipoib(O) ib_cm(O) ib_umad(O) sunrpc vfat fat ext4 mbcache jbd2 mlx5_ib(O) ib_uverbs(O) ib_core(O) mlx5_core(O) mlxfw(O) pci_hyperv_intf crc32c_inte l tls ahci nvme psample igb libahci mlxdevm(O) auxiliary(O) nvme_core i2c_algo_bit libata t10_pi dca mlx_compat(O) pinctrl_amd xpmem(O) ipmi_devintf ipmi_msghandler 2022-08-15T20:10:26+02:00 cn0539 kernel: CPU: 78 PID: 14675 Comm: slurm_script Tainted: G        W  O      5.10.123-2.el8.x86_64 #1 2022-08-15T20:10:27+02:00 cn0539 kernel: Hardware name: To be filled by O.E.M. To be filled by O.E.M./CER, BIOS BIOS_RME090.22.37.001 10/05/2021 2022-08-15T20:10:29+02:00 cn0539 kernel: RIP: 0010:__lookup_inode+0x161/0x180 [ceph] 2022-08-15T20:10:30+02:00 cn0539 kernel: Code: dd 48 85 db 0f 85 27 ff ff ff 45 85 e4 0f 89 5d ff ff ff 49 63 ec e9 16 ff ff ff 48 89 de 48 c7 c7 58 bb 40 c1 e8 1e 21 d8 d0 <0f> 0b e9 3f ff ff ff e8 53 3d 01 00 eb c6 be 03 00 00 00 e8 97 a2 2022-08-15T20:10:33+02:00 cn0539 kernel: RSP: 0018:ffffb6d8d2ab7c18 EFLAGS: 00010286 2022-08-15T20:10:33+02:00 cn0539 kernel: RAX: 0000000000000000 RBX: 0000000000000101 RCX: 0000000000000027 2022-08-15T20:10:35+02:00 cn0539 kernel: RDX: 0000000000000027 RSI: ffff9632af9a07e0 RDI: ffff9632af9a07e8 2022-08-15T20:10:36+02:00 cn0539 kernel: RBP: ffff965345e568a0 R08: 0000000000000000 R09: c0000000fffeffff 2022-08-15T20:10:37+02:00 cn0539 kernel: R10: 0000000000000001 R11: ffffb6d8d2ab7a20 R12: ffff959e55081aa8 2022-08-15T20:10:38+02:00 cn0539 kernel: R13: ffff965345e568a8 R14: ffff9593f4994600 R15: ffff959e55081a80 2022-08-15T20:10:39+02:00 cn0539 kernel: FS:  00007f660e249740(0000) GS:ffff9632af980000(0000) knlGS:0000000000000000 2022-08-15T20:10:40+02:00 cn0539 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2022-08-15T20:10:41+02:00 cn0539 kernel: CR2: 000055d6b3db5588 CR3: 0000008a75ce8000 CR4: 0000000000150ee0
2022-08-15T20:10:42+02:00 cn0539 kernel: Call Trace:
2022-08-15T20:10:43+02:00 cn0539 kernel: ? __do_request+0x3f0/0x450 [ceph]
2022-08-15T20:10:43+02:00 cn0539 kernel: ceph_lookup_inode+0xa/0x30 [ceph]
2022-08-15T20:10:44+02:00 cn0539 kernel: lookup_quotarealm_inode.isra.9+0x188/0x210 [ceph] 2022-08-15T20:10:45+02:00 cn0539 kernel: check_quota_exceeded+0x1bc/0x220 [ceph]
2022-08-15T20:10:46+02:00 cn0539 kernel: ceph_write_iter+0x1bf/0xc90 [ceph]
2022-08-15T20:10:47+02:00 cn0539 kernel: ? path_openat+0x666/0x1050
2022-08-15T20:10:47+02:00 cn0539 kernel: ? __do_request+0x3f0/0x450 [ceph]
2022-08-15T20:10:48+02:00 cn0539 kernel: ? __ceph_put_cap_refs+0x30/0x380 [ceph]
2022-08-15T20:10:49+02:00 cn0539 kernel: ? ptep_set_access_flags+0x23/0x30
2022-08-15T20:10:49+02:00 cn0539 kernel: ? wp_page_reuse+0x5f/0x70
2022-08-15T20:10:50+02:00 cn0539 kernel: ? new_sync_write+0x11f/0x1b0
2022-08-15T20:10:51+02:00 cn0539 kernel: new_sync_write+0x11f/0x1b0
2022-08-15T20:10:51+02:00 cn0539 kernel: vfs_write+0x1bd/0x270
2022-08-15T20:10:52+02:00 cn0539 kernel: ksys_write+0x59/0xd0
2022-08-15T20:10:52+02:00 cn0539 kernel: do_syscall_64+0x33/0x40
2022-08-15T20:10:53+02:00 cn0539 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
2022-08-15T20:10:54+02:00 cn0539 kernel: RIP: 0033:0x7f660d9555a8
2022-08-15T20:10:54+02:00 cn0539 kernel: Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 f5 3f 2a 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55 2022-08-15T20:10:57+02:00 cn0539 kernel: RSP: 002b:00007ffe2286c368 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 2022-08-15T20:10:58+02:00 cn0539 kernel: RAX: ffffffffffffffda RBX: 0000000000000417 RCX: 00007f660d9555a8 2022-08-15T20:10:59+02:00 cn0539 kernel: RDX: 0000000000000417 RSI: 000055d6b3dd5470 RDI: 0000000000000004 2022-08-15T20:11:01+02:00 cn0539 kernel: RBP: 000055d6b3dd5470 R08: 0000000000000008 R09: 00224b5341545f52 2022-08-15T20:11:02+02:00 cn0539 kernel: R10: 0000000000000025 R11: 0000000000000246 R12: 000055d6b3dc7f50


--
_____________________________________________________________
   prof. dr. Andrej Filipcic,   E-mail:Andrej.Filipcic@xxxxxx
   Department of Experimental High Energy Physics - F9
   Jozef Stefan Institute, Jamova 39, P.o.Box 3000
   SI-1001 Ljubljana, Slovenia
   Tel.: +386-1-477-3674    Fax: +386-1-477-3166
-------------------------------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux