Hi,
we experienced massive node failures when a user with cephfs quota
exceeded submitted many jobs to a slurm cluster, home is on cephfs. The
nodes still work for some time, but they eventually freeze due to too
many stuck CPUs
Is this a kernel ceph client bug? running on 5.10.123, ceph cluster is
16.2.9.
Best regards,
Andrej
2022-08-15T20:08:01+02:00 cn0539 kernel: ------------[ cut here
]------------
2022-08-15T20:08:01+02:00 cn0539 kernel: Attempt to access reserved
inode number 0x101
2022-08-15T20:08:01+02:00 cn0539 kernel: WARNING: CPU: 172 PID: 4185848
at fs/ceph/super.h:547 __lookup_inode+0x161/0x180 [ceph]
2022-08-15T20:08:14+02:00 cn0539 kernel: Modules linked in: squashfs
loop overlay fuse ceph libceph mgc(O) lustre(O) lmv(O) mdc(O) fid(O)
lov(O) fld(O) osc(O) ko2iblnd(O) ptlrpc(O) obdclass(O) lnet(O) libcfs(O)
rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace nfs_ssc
fscache rfkill ipmi_ssif nft_limit amd64_edac_mod edac_mce_amd
amd_energy nft_ct kvm_amd nf_conntrack
nf_defrag_ipv6 kvm nf_defrag_ipv4 irqbypass crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel rapl pcspkr nf_tables libcrc32c
nfnetlink sp5100_tco ccp acpi_ipmi k10temp i2c_piix4 ipmi_si rdma_ucm(O)
rdma_cm(O) iw_cm(O) acpi_cpufreq ib_ipoib(O) ib_cm(O) ib_umad(O) sunrpc
vfat fat ext4 mbcache jbd2 mlx5_ib(O) ib_uverbs(O) ib_core(O)
mlx5_core(O) mlxfw(O) pci_hyperv_intf crc32c_inte
l tls ahci nvme psample igb libahci mlxdevm(O) auxiliary(O) nvme_core
i2c_algo_bit libata t10_pi dca mlx_compat(O) pinctrl_amd xpmem(O)
ipmi_devintf ipmi_msghandler
2022-08-15T20:08:14+02:00 cn0539 kernel: CPU: 172 PID: 4185848 Comm:
slurm_script Tainted: G W O 5.10.123-2.el8.x86_64 #1
2022-08-15T20:08:16+02:00 cn0539 kernel: Hardware name: To be filled by
O.E.M. To be filled by O.E.M./CER, BIOS BIOS_RME090.22.37.001 10/05/2021
2022-08-15T20:08:17+02:00 cn0539 kernel: RIP:
0010:__lookup_inode+0x161/0x180 [ceph]
2022-08-15T20:08:18+02:00 cn0539 kernel: Code: dd 48 85 db 0f 85 27 ff
ff ff 45 85 e4 0f 89 5d ff ff ff 49 63 ec e9 16 ff ff ff 48 89 de 48 c7
c7 58 bb 40 c1 e8 1e 21 d8 d0 <0f> 0b e9 3f ff ff ff e8 53 3d 01 00 eb
c6 be 03 00 00 00 e8 97 a2
2022-08-15T20:08:21+02:00 cn0539 kernel: RSP: 0018:ffffb6d8de33fc18
EFLAGS: 00010286
2022-08-15T20:08:22+02:00 cn0539 kernel: RAX: 0000000000000000 RBX:
0000000000000101 RCX: 0000000000000027
2022-08-15T20:08:23+02:00 cn0539 kernel: RDX: 0000000000000027 RSI:
ffff95f2afd207e0 RDI: ffff95f2afd207e8
2022-08-15T20:08:24+02:00 cn0539 kernel: RBP: ffff965345e568a0 R08:
0000000000000000 R09: c0000000fffeffff
2022-08-15T20:08:25+02:00 cn0539 kernel: R10: 0000000000000001 R11:
ffffb6d8de33fa20 R12: ffff959e55081aa8
2022-08-15T20:08:27+02:00 cn0539 kernel: R13: ffff965345e568a8 R14:
ffff9593ea333e00 R15: ffff959e55081a80
2022-08-15T20:08:28+02:00 cn0539 kernel: FS: 00007fbf7c8ba740(0000)
GS:ffff95f2afd00000(0000) knlGS:0000000000000000
2022-08-15T20:08:29+02:00 cn0539 kernel: CS: 0010 DS: 0000 ES: 0000
CR0: 0000000080050033
2022-08-15T20:08:30+02:00 cn0539 kernel: CR2: 0000564324b8a588 CR3:
0000004d51150000 CR4: 0000000000150ee0
2022-08-15T20:08:31+02:00 cn0539 kernel: Call Trace:
2022-08-15T20:08:31+02:00 cn0539 kernel: ? __do_request+0x3f0/0x450 [ceph]
2022-08-15T20:08:32+02:00 cn0539 kernel: ceph_lookup_inode+0xa/0x30 [ceph]
2022-08-15T20:08:34+02:00 cn0539 kernel:
lookup_quotarealm_inode.isra.9+0x188/0x210 [ceph]
2022-08-15T20:08:34+02:00 cn0539 kernel:
check_quota_exceeded+0x1bc/0x220 [ceph]
2022-08-15T20:08:34+02:00 cn0539 kernel: ceph_write_iter+0x1bf/0xc90 [ceph]
2022-08-15T20:08:35+02:00 cn0539 kernel: ? path_openat+0x666/0x1050
2022-08-15T20:08:36+02:00 cn0539 kernel: ? __touch_cap+0x1f/0xd0 [ceph]
2022-08-15T20:08:36+02:00 cn0539 kernel: ? ptep_set_access_flags+0x23/0x30
2022-08-15T20:08:37+02:00 cn0539 kernel: ? wp_page_reuse+0x5f/0x70
2022-08-15T20:08:38+02:00 cn0539 kernel: ? new_sync_write+0x11f/0x1b0
2022-08-15T20:08:38+02:00 cn0539 kernel: new_sync_write+0x11f/0x1b0
2022-08-15T20:08:39+02:00 cn0539 kernel: vfs_write+0x1bd/0x270
2022-08-15T20:08:40+02:00 cn0539 kernel: ksys_write+0x59/0xd0
2022-08-15T20:08:40+02:00 cn0539 kernel: do_syscall_64+0x33/0x40
2022-08-15T20:08:41+02:00 cn0539 kernel:
entry_SYSCALL_64_after_hwframe+0x44/0xa9
2022-08-15T20:08:41+02:00 cn0539 kernel: RIP: 0033:0x7fbf7bfc65a8
2022-08-15T20:08:42+02:00 cn0539 kernel: Code: 89 02 48 c7 c0 ff ff ff
ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 f5 3f 2a 00 8b 00 85
c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00
00 00 00 41 54 49 89 d4 55
2022-08-15T20:08:45+02:00 cn0539 kernel: RSP: 002b:00007ffcc4ad6dd8
EFLAGS: 00000246 ORIG_RAX: 0000000000000001
2022-08-15T20:08:46+02:00 cn0539 kernel: RAX: ffffffffffffffda RBX:
0000000000000417 RCX: 00007fbf7bfc65a8
2022-08-15T20:08:47+02:00 cn0539 kernel: RDX: 0000000000000417 RSI:
0000564324baa470 RDI: 0000000000000004
2022-08-15T20:08:48+02:00 cn0539 kernel: RBP: 0000564324baa470 R08:
0000000000000008 R09: 00224b5341545f52
2022-08-15T20:08:49+02:00 cn0539 kernel: R10: 0000000000000025 R11:
0000000000000246 R12: 0000564324b9cf50
2022-08-15T20:08:51+02:00 cn0539 kernel: R13: 0000000000000000 R14:
0000564324ba6200 R15: 0000564324b9cf50
2022-08-15T20:08:52+02:00 cn0539 kernel: ---[ end trace a655820d09b78154
]---
2022-08-15T20:09:58+02:00 cn0539 kernel: mlx5_core 0000:61:00.0:
mlx5_cmd_out_err:800:(pid 4155261): MAD_IFC(0x50d) op_mod(0x0) failed,
status bad packet (discarded)(0x30), syndrome (0xea9eb5), err(-22)
2022-08-15T20:09:58+02:00 cn0539 kernel: mlx5_core 0000:61:00.0:
mlx5_cmd_out_err:800:(pid 4155261): MAD_IFC(0x50d) op_mod(0x0) failed,
status bad packet (discarded)(0x30), syndrome (0xea9eb5), err(-22)
2022-08-15T20:10:12+02:00 cn0539 kernel: ------------[ cut here
]------------
2022-08-15T20:10:12+02:00 cn0539 kernel: Attempt to access reserved
inode number 0x101
2022-08-15T20:10:12+02:00 cn0539 kernel: WARNING: CPU: 78 PID: 14675 at
fs/ceph/super.h:547 __lookup_inode+0x161/0x180 [ceph]
2022-08-15T20:10:26+02:00 cn0539 kernel: Modules linked in: squashfs
loop overlay fuse ceph libceph mgc(O) lustre(O) lmv(O) mdc(O) fid(O)
lov(O) fld(O) osc(O) ko2iblnd(O) ptlrpc(O) obdclass(O) lnet(O) libcfs(O)
rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace nfs_ssc
fscache rfkill ipmi_ssif nft_limit amd64_edac_mod edac_mce_amd
amd_energy nft_ct kvm_amd nf_conntrack
nf_defrag_ipv6 kvm nf_defrag_ipv4 irqbypass crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel rapl pcspkr nf_tables libcrc32c
nfnetlink sp5100_tco ccp acpi_ipmi k10temp i2c_piix4 ipmi_si rdma_ucm(O)
rdma_cm(O) iw_cm(O) acpi_cpufreq ib_ipoib(O) ib_cm(O) ib_umad(O) sunrpc
vfat fat ext4 mbcache jbd2 mlx5_ib(O) ib_uverbs(O) ib_core(O)
mlx5_core(O) mlxfw(O) pci_hyperv_intf crc32c_inte
l tls ahci nvme psample igb libahci mlxdevm(O) auxiliary(O) nvme_core
i2c_algo_bit libata t10_pi dca mlx_compat(O) pinctrl_amd xpmem(O)
ipmi_devintf ipmi_msghandler
2022-08-15T20:10:26+02:00 cn0539 kernel: CPU: 78 PID: 14675 Comm:
slurm_script Tainted: G W O 5.10.123-2.el8.x86_64 #1
2022-08-15T20:10:27+02:00 cn0539 kernel: Hardware name: To be filled by
O.E.M. To be filled by O.E.M./CER, BIOS BIOS_RME090.22.37.001 10/05/2021
2022-08-15T20:10:29+02:00 cn0539 kernel: RIP:
0010:__lookup_inode+0x161/0x180 [ceph]
2022-08-15T20:10:30+02:00 cn0539 kernel: Code: dd 48 85 db 0f 85 27 ff
ff ff 45 85 e4 0f 89 5d ff ff ff 49 63 ec e9 16 ff ff ff 48 89 de 48 c7
c7 58 bb 40 c1 e8 1e 21 d8 d0 <0f> 0b e9 3f ff ff ff e8 53 3d 01 00 eb
c6 be 03 00 00 00 e8 97 a2
2022-08-15T20:10:33+02:00 cn0539 kernel: RSP: 0018:ffffb6d8d2ab7c18
EFLAGS: 00010286
2022-08-15T20:10:33+02:00 cn0539 kernel: RAX: 0000000000000000 RBX:
0000000000000101 RCX: 0000000000000027
2022-08-15T20:10:35+02:00 cn0539 kernel: RDX: 0000000000000027 RSI:
ffff9632af9a07e0 RDI: ffff9632af9a07e8
2022-08-15T20:10:36+02:00 cn0539 kernel: RBP: ffff965345e568a0 R08:
0000000000000000 R09: c0000000fffeffff
2022-08-15T20:10:37+02:00 cn0539 kernel: R10: 0000000000000001 R11:
ffffb6d8d2ab7a20 R12: ffff959e55081aa8
2022-08-15T20:10:38+02:00 cn0539 kernel: R13: ffff965345e568a8 R14:
ffff9593f4994600 R15: ffff959e55081a80
2022-08-15T20:10:39+02:00 cn0539 kernel: FS: 00007f660e249740(0000)
GS:ffff9632af980000(0000) knlGS:0000000000000000
2022-08-15T20:10:40+02:00 cn0539 kernel: CS: 0010 DS: 0000 ES: 0000
CR0: 0000000080050033
2022-08-15T20:10:41+02:00 cn0539 kernel: CR2: 000055d6b3db5588 CR3:
0000008a75ce8000 CR4: 0000000000150ee0
2022-08-15T20:10:42+02:00 cn0539 kernel: Call Trace:
2022-08-15T20:10:43+02:00 cn0539 kernel: ? __do_request+0x3f0/0x450 [ceph]
2022-08-15T20:10:43+02:00 cn0539 kernel: ceph_lookup_inode+0xa/0x30 [ceph]
2022-08-15T20:10:44+02:00 cn0539 kernel:
lookup_quotarealm_inode.isra.9+0x188/0x210 [ceph]
2022-08-15T20:10:45+02:00 cn0539 kernel:
check_quota_exceeded+0x1bc/0x220 [ceph]
2022-08-15T20:10:46+02:00 cn0539 kernel: ceph_write_iter+0x1bf/0xc90 [ceph]
2022-08-15T20:10:47+02:00 cn0539 kernel: ? path_openat+0x666/0x1050
2022-08-15T20:10:47+02:00 cn0539 kernel: ? __do_request+0x3f0/0x450 [ceph]
2022-08-15T20:10:48+02:00 cn0539 kernel: ?
__ceph_put_cap_refs+0x30/0x380 [ceph]
2022-08-15T20:10:49+02:00 cn0539 kernel: ? ptep_set_access_flags+0x23/0x30
2022-08-15T20:10:49+02:00 cn0539 kernel: ? wp_page_reuse+0x5f/0x70
2022-08-15T20:10:50+02:00 cn0539 kernel: ? new_sync_write+0x11f/0x1b0
2022-08-15T20:10:51+02:00 cn0539 kernel: new_sync_write+0x11f/0x1b0
2022-08-15T20:10:51+02:00 cn0539 kernel: vfs_write+0x1bd/0x270
2022-08-15T20:10:52+02:00 cn0539 kernel: ksys_write+0x59/0xd0
2022-08-15T20:10:52+02:00 cn0539 kernel: do_syscall_64+0x33/0x40
2022-08-15T20:10:53+02:00 cn0539 kernel:
entry_SYSCALL_64_after_hwframe+0x44/0xa9
2022-08-15T20:10:54+02:00 cn0539 kernel: RIP: 0033:0x7f660d9555a8
2022-08-15T20:10:54+02:00 cn0539 kernel: Code: 89 02 48 c7 c0 ff ff ff
ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 f5 3f 2a 00 8b 00 85
c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00
00 00 00 41 54 49 89 d4 55
2022-08-15T20:10:57+02:00 cn0539 kernel: RSP: 002b:00007ffe2286c368
EFLAGS: 00000246 ORIG_RAX: 0000000000000001
2022-08-15T20:10:58+02:00 cn0539 kernel: RAX: ffffffffffffffda RBX:
0000000000000417 RCX: 00007f660d9555a8
2022-08-15T20:10:59+02:00 cn0539 kernel: RDX: 0000000000000417 RSI:
000055d6b3dd5470 RDI: 0000000000000004
2022-08-15T20:11:01+02:00 cn0539 kernel: RBP: 000055d6b3dd5470 R08:
0000000000000008 R09: 00224b5341545f52
2022-08-15T20:11:02+02:00 cn0539 kernel: R10: 0000000000000025 R11:
0000000000000246 R12: 000055d6b3dc7f50
--
_____________________________________________________________
prof. dr. Andrej Filipcic, E-mail:Andrej.Filipcic@xxxxxx
Department of Experimental High Energy Physics - F9
Jozef Stefan Institute, Jamova 39, P.o.Box 3000
SI-1001 Ljubljana, Slovenia
Tel.: +386-1-477-3674 Fax: +386-1-477-3166
-------------------------------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx