Hi all,
Tonight a node of our 5-node cluster had a issue with a XFS filestore
OSD. Kernel started to log crash dumps and it hasn't stopped until the
was no more free disk space affecting the entire node.
Space has been eaten by kernel logs in /var/log/kern.log
/var/log/messages and /var/log/syslog that were taking EACH more than
20GB this morning.
syslog:
Jan 24 22:04:06 sanmarko kernel: [389213.340613] vmbr23: port
2(tap112i0) entered blocking state
Jan 24 22:04:06 sanmarko kernel: [389213.340614] vmbr23: port
2(tap112i0) entered forwarding state
Jan 24 22:04:25 sanmarko ceph-osd[1277578]: 2022-01-24T22:04:25.772+0100
7fdfe8961700 -1 log_channel(cluster) log [ERR] : 4.db full-object read
crc 0x6c359b89 != expected 0x84ca4d31 on
4:db6f8075:::rbd_data.319eae6b8b4567.0000000000000237:head
Jan 24 22:04:25 sanmarko ceph-osd[1277578]: 2022-01-24T22:04:25.772+0100
7fdfe8961700 -1 log_channel(cluster) log [ERR] : 4.db missing primary
copy of 4:db6f8075:::rbd_data.319eae6b8b4567.0000000000000237:head, will
try copies on 2,10
Jan 24 22:04:26 sanmarko kernel: [389233.797384] ------------[ cut here
]------------
Jan 24 22:04:26 sanmarko kernel: [389233.797387] WARNING: CPU: 8 PID:
1277607 at mm/truncate.c:395 truncate_inode_pages_range+0x4e5/0x5e0
Jan 24 22:04:26 sanmarko kernel: [389233.797392] Modules linked in: veth
nfsv3 nfs_acl nfs lockd grace fscache netfs xfs ebtable_filter ebtables
ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables
iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel ipmi_devintf
ipmi_msghandler nf_tables 8021q garp mrp bonding tls softdog
nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common
snd_hda_codec_realtek ppdev snd_hda_codec_generic ledtrig_audio
snd_hda_codec_hdmi edac_mce_amd nouveau mxm_wmi snd_hda_intel kvm_amd
video snd_intel_dspcfg snd_intel_sdw_acpi drm_ttm_helper snd_hda_codec
ttm kvm irqbypass snd_hda_core drm_kms_helper crct10dif_pclmul snd_hwdep
ghash_clmulni_intel cec snd_pcm aesni_intel rc_core i2c_algo_bit
snd_timer crypto_simd cryptd fb_sys_fops snd syscopyarea sysfillrect
rapl sysimgblt soundcore ccp pcspkr k10temp gigabyte_wmi wmi_bmof
parport_pc zfs(PO) parport mac_hid zunicode(PO) zzstd(O) zlua(O)
zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost
vhost_iotlb tap ib_iser rdma_cm iw_cm
Jan 24 22:04:26 sanmarko kernel: [389233.797435] ib_cm ib_core
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc
ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress
raid6_pq crc32_pclmul i2c_piix4 bnx2x mdio r8169 libcrc32c realtek
xhci_pci xhci_pci_renesas ahci xhci_hcd libahci wmi gpio_amdpt gpio_generic
Jan 24 22:04:26 sanmarko kernel: [389233.797451] CPU: 8 PID: 1277607
Comm: tp_fstore_op Tainted: P O 5.13.19-3-pve #1
Jan 24 22:04:26 sanmarko kernel: [389233.797453] Hardware name: Gigabyte
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
Jan 24 22:04:26 sanmarko kernel: [389233.797454] RIP:
0010:truncate_inode_pages_range+0x4e5/0x5e0
Jan 24 22:04:26 sanmarko kernel: [389233.797456] Code: 8b 45 08 48 8d 50
ff a8 01 4c 0f 45 ea 0f 1f 44 00 00 f0 41 ff 4d 34 0f 85 c1 fc ff ff 4c
89 ef e8 30 ce ff ff e9 b4 fc ff ff <0f> 0b e9 8b fd ff ff 49 8b 45 00
48 c1 e8 33 83 e0 07 83 f8 04 0f
Jan 24 22:04:26 sanmarko kernel: [389233.797457] RSP:
0018:ffff9fbb0285fbf0 EFLAGS: 00010293
Jan 24 22:04:26 sanmarko kernel: [389233.797458] RAX: 0000000000000249
RBX: ffff9fbb0285fc88 RCX: 0000000000000000
Jan 24 22:04:26 sanmarko kernel: [389233.797459] RDX: 0000000000000000
RSI: ffffffffffffffc0 RDI: 0000000000000000
Jan 24 22:04:26 sanmarko kernel: [389233.797460] RBP: ffff9fbb0285fd38
R08: 0000000000000000 R09: fffffffffffffffe
Jan 24 22:04:26 sanmarko kernel: [389233.797461] R10: 0000000000001000
R11: 0000000000000002 R12: ffff8eb7c315e418
Jan 24 22:04:26 sanmarko kernel: [389233.797462] R13: fffff965da0b5940
R14: ffff9fbb0285fc10 R15: 0000000000000000
Jan 24 22:04:26 sanmarko kernel: [389233.797463] FS:
00007fdffa427700(0000) GS:ffff8ece1e600000(0000) knlGS:0000000000000000
Jan 24 22:04:26 sanmarko kernel: [389233.797464] CS: 0010 DS: 0000 ES:
0000 CR0: 0000000080050033
Jan 24 22:04:26 sanmarko kernel: [389233.797465] CR2: 00007fdffd74fc4a
CR3: 000000016e3b6000 CR4: 0000000000350ee0
Jan 24 22:04:26 sanmarko kernel: [389233.797466] Call Trace:
Jan 24 22:04:26 sanmarko kernel: [389233.797467] <TASK>
Jan 24 22:04:26 sanmarko kernel: [389233.797470] ?
xfs_stats_format+0x1f1/0x220 [xfs]
Jan 24 22:04:26 sanmarko kernel: [389233.797524]
truncate_inode_pages_final+0x4a/0x50
Jan 24 22:04:26 sanmarko kernel: [389233.797526] evict+0x16b/0x180
Jan 24 22:04:26 sanmarko kernel: [389233.797528] iput.part.0+0x173/0x1d0
Jan 24 22:04:26 sanmarko kernel: [389233.797529] iput+0x1c/0x30
Jan 24 22:04:26 sanmarko kernel: [389233.797530] do_unlinkat+0x1cd/0x2d0
Jan 24 22:04:26 sanmarko kernel: [389233.797531] __x64_sys_unlink+0x42/0x60
Jan 24 22:04:26 sanmarko kernel: [389233.797532] ?
syscall_enter_from_user_mode+0x2b/0x40
Jan 24 22:04:26 sanmarko kernel: [389233.797534] do_syscall_64+0x61/0xb0
Jan 24 22:04:26 sanmarko kernel: [389233.797536] ?
syscall_exit_to_user_mode+0x27/0x50
Jan 24 22:04:26 sanmarko kernel: [389233.797537] ?
__x64_sys_newstat+0x16/0x20
Jan 24 22:04:26 sanmarko kernel: [389233.797539] ? do_syscall_64+0x6e/0xb0
Jan 24 22:04:26 sanmarko kernel: [389233.797541] ?
exit_to_user_mode_prepare+0x37/0x1b0
Jan 24 22:04:26 sanmarko kernel: [389233.797543] ?
syscall_exit_to_user_mode+0x27/0x50
Jan 24 22:04:26 sanmarko kernel: [389233.797544] ?
__x64_sys_newstat+0x16/0x20
Jan 24 22:04:26 sanmarko kernel: [389233.797545] ? do_syscall_64+0x6e/0xb0
Jan 24 22:04:26 sanmarko kernel: [389233.797547] ? do_syscall_64+0x6e/0xb0
Jan 24 22:04:26 sanmarko kernel: [389233.797548] ?
irqentry_exit_to_user_mode+0x9/0x20
Jan 24 22:04:26 sanmarko kernel: [389233.797549] ? irqentry_exit+0x19/0x30
Jan 24 22:04:26 sanmarko kernel: [389233.797550] ?
exc_page_fault+0x8f/0x170
Jan 24 22:04:26 sanmarko kernel: [389233.797551] ?
asm_exc_page_fault+0x8/0x30
Jan 24 22:04:26 sanmarko kernel: [389233.797552]
entry_SYSCALL_64_after_hwframe+0x44/0xae
Jan 24 22:04:26 sanmarko kernel: [389233.797554] RIP: 0033:0x7fe00600eb57
Jan 24 22:04:26 sanmarko kernel: [389233.797555] Code: f0 ff ff 73 01 c3
48 8b 0d 36 d3 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00
00 00 00 66 90 b8 57 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b
0d 09 d3 0c 00 f7 d8 64 89 01 48
Jan 24 22:04:26 sanmarko kernel: [389233.797556] RSP:
002b:00007fdffa423778 EFLAGS: 00000246 ORIG_RAX: 0000000000000057
Jan 24 22:04:26 sanmarko kernel: [389233.797558] RAX: ffffffffffffffda
RBX: 00007fdffa423940 RCX: 00007fe00600eb57
Jan 24 22:04:26 sanmarko kernel: [389233.797559] RDX: 00007fdffa423730
RSI: 000056505e77c2e0 RDI: 00005650784789a0
Jan 24 22:04:26 sanmarko kernel: [389233.797559] RBP: 000056505ffacf00
R08: 0000000000000007 R09: 000056505e77c200
Jan 24 22:04:26 sanmarko kernel: [389233.797560] R10: 00000002b2830000
R11: 0000000000000246 R12: 00005650797cac60
Jan 24 22:04:26 sanmarko kernel: [389233.797561] R13: 00007fdffa423840
R14: 00005650797cac60 R15: 00007fdffa423ac0
Jan 24 22:04:26 sanmarko kernel: [389233.797562] </TASK>
Jan 24 22:04:26 sanmarko kernel: [389233.797562] ---[ end trace
fc75378776329d96 ]---
Jan 24 22:04:26 sanmarko kernel: [389233.797566] ------------[ cut here
]------------
Jan 24 22:04:26 sanmarko kernel: [389233.797566] WARNING: CPU: 8 PID:
1277607 at mm/truncate.c:395 truncate_inode_pages_range+0x4e5/0x5e0
Jan 24 22:04:26 sanmarko kernel: [389233.797568] Modules linked in: veth
nfsv3 nfs_acl nfs lockd grace fscache netfs xfs ebtable_filter ebtables
ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables
iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel ipmi_devintf
ipmi_msghandler nf_tables 8021q garp mrp bonding tls softdog
nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common
snd_hda_codec_realtek ppdev snd_hda_codec_generic ledtrig_audio
snd_hda_codec_hdmi edac_mce_amd nouveau mxm_wmi snd_hda_intel kvm_amd
video snd_intel_dspcfg snd_intel_sdw_acpi drm_ttm_helper snd_hda_codec
ttm kvm irqbypass snd_hda_core drm_kms_helper crct10dif_pclmul snd_hwdep
ghash_clmulni_intel cec snd_pcm aesni_intel rc_core i2c_algo_bit
snd_timer crypto_simd cryptd fb_sys_fops snd syscopyarea sysfillrect
rapl sysimgblt soundcore ccp pcspkr k10temp gigabyte_wmi wmi_bmof
parport_pc zfs(PO) parport mac_hid zunicode(PO) zzstd(O) zlua(O)
zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost
vhost_iotlb tap ib_iser rdma_cm iw_cm
Jan 24 22:04:26 sanmarko kernel: [389233.797594] ib_cm ib_core
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc
ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress
raid6_pq crc32_pclmul i2c_piix4 bnx2x mdio r8169 libcrc32c realtek
xhci_pci xhci_pci_renesas ahci xhci_hcd libahci wmi gpio_amdpt gpio_generic
Jan 24 22:04:26 sanmarko kernel: [389233.797602] CPU: 8 PID: 1277607
Comm: tp_fstore_op Tainted: P W O 5.13.19-3-pve #1
Jan 24 22:04:26 sanmarko kernel: [389233.797603] Hardware name: Gigabyte
Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019
Jan 24 22:04:26 sanmarko kernel: [389233.797604] RIP:
0010:truncate_inode_pages_range+0x4e5/0x5e0
Jan 24 22:04:26 sanmarko kernel: [389233.797605] Code: 8b 45 08 48 8d 50
ff a8 01 4c 0f 45 ea 0f 1f 44 00 00 f0 41 ff 4d 34 0f 85 c1 fc ff ff 4c
89 ef e8 30 ce ff ff e9 b4 fc ff ff <0f> 0b e9 8b fd ff ff 49 8b 45 00
48 c1 e8 33 83 e0 07 83 f8 04 0f
Jan 24 22:04:26 sanmarko kernel: [389233.797606] RSP:
0018:ffff9fbb0285fbf0 EFLAGS: 00010293
Jan 24 22:04:26 sanmarko kernel: [389233.797607] RAX: 0000000000000249
RBX: ffff9fbb0285fc88 RCX: 0000000000000000
Jan 24 22:04:26 sanmarko kernel: [389233.797608] RDX: 0000000000000000
RSI: ffffffffffffffc0 RDI: 0000000000000000
Jan 24 22:04:26 sanmarko kernel: [389233.797608] RBP: ffff9fbb0285fd38
R08: 0000000000000000 R09: fffffffffffffffe
Jan 24 22:04:26 sanmarko kernel: [389233.797609] R10: 0000000000001000
R11: 0000000000000002 R12: ffff8eb7c315e418
Jan 24 22:04:26 sanmarko kernel: [389233.797610] R13: fffff965da0b5940
R14: ffff9fbb0285fc10 R15: 0000000000000000
Jan 24 22:04:26 sanmarko kernel: [389233.797610] FS:
00007fdffa427700(0000) GS:ffff8ece1e600000(0000) knlGS:0000000000000000
Jan 24 22:04:26 sanmarko kernel: [389233.797611] CS: 0010 DS: 0000 ES:
0000 CR0: 0000000080050033
Jan 24 22:04:26 sanmarko kernel: [389233.797612] CR2: 00007fdffd74fc4a
CR3: 000000016e3b6000 CR4: 0000000000350ee0
Jan 24 22:04:26 sanmarko kernel: [389233.797613] Call Trace:
Jan 24 22:04:26 sanmarko kernel: [389233.797613] <TASK>
Jan 24 22:04:26 sanmarko kernel: [389233.797614] ?
xfs_stats_format+0x1f1/0x220 [xfs]
[... similar traces repeated endlessly...]
Jan 24 22:11:38 sanmarko kernel: [389666.155459] ---[ end trace
fc75378776669ecc ]---
Jan 24 22:11:38 sanmarko kernel: [389666.155462] ------------[ cut here
]------------
Jan 24 22:11:38 sanmarko kernel: [389666.155462] WARNING: CPU: 2 PID:
1277607 at mm/truncate.c:395 truncate_inode_pages_range+0x4e5/0x5e0
Jan 24 22:11:38 sanmarko kernel: [389666.155464] Modules linked in: veth
[... reset by Proxmox watchdog ...]
Jan 24 22:13:24 sanmarko systemd-modules-load[551]: Inserted module
'iscsi_tcp'
Jan 24 22:13:24 sanmarko kernel: [ 0.000000] Linux version
5.13.19-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110,
GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.13.19-6 (Tue, 11
Jan 2022 16:44:47 +0100) ()
Jan 24 22:13:24 sanmarko kernel: [ 0.000000] Command line:
BOOT_IMAGE=/vmlinuz-5.13.19-3-pve root=/dev/mapper/pve-root ro quiet
Seems related to XFS. The only XFS filesystems on this server are for 3
filestore Ceph OSDs. Traces bounced from CPU 8, to 4, then 0 and finally 2.
Issue seems to be OSD.6 on that node (it was reported down for about 10
minues), relevant log:
2022-01-24T21:50:06.883+0100 7fdfea965700 0 log_channel(cluster) log
[DBG] : 4.1c2 scrub ok
2022-01-24T21:57:17.344+0100 7fdffd42d700 1 leveldb: Level-0 table
#1273240: started
2022-01-24T21:57:17.412+0100 7fdffd42d700 1 leveldb: Level-0 table
#1273240: 1398346 bytes OK
2022-01-24T21:57:17.468+0100 7fdffd42d700 1 leveldb: Delete type=0 #1273237
2022-01-24T22:04:25.772+0100 7fdfe8961700 -1 log_channel(cluster) log
[ERR] : 4.db full-object read crc 0x6c359b89 != expected 0x84ca4d31 on
4:db6f8075:::rbd_data.319eae6b8b4567.0000000000000237:head
2022-01-24T22:04:25.772+0100 7fdfe8961700 -1 log_channel(cluster) log
[ERR] : 4.db missing primary copy of
4:db6f8075:::rbd_data.319eae6b8b4567.0000000000000237:head, will try
copies on 2,10
2022-01-24T22:04:40.992+0100 7fe0045e9700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7fdfe8961700' had timed out after 15.000000954s
2022-01-24T22:04:40.992+0100 7fe003de8700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7fdfe8961700' had timed out after 15.000000954s
[... repeated lines with incrementing time...]
2022-01-24T22:05:04.468+0100 7fe003de8700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7fdfea164700' had timed out after 15.000000954s
2022-01-24T22:05:04.468+0100 7fe003de8700 1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7fdfea965700' had timed out after 15.000000954s
2022-01-24T22:13:31.851+0100 7fa52b9b6f00 0 set uid:gid to 64045:64045
(ceph:ceph)
2022-01-24T22:13:31.851+0100 7fa52b9b6f00 0 ceph version 16.2.7
(f9aa029788115b5df5eeee328f584156565ee5b7) pacific (stable), process
ceph-osd, pid 2290
2022-01-24T22:13:31.851+0100 7fa52b9b6f00 0 pidfile_write: ignore empty
--pid-file
2022-01-24T22:13:31.851+0100 7fa52b9b6f00 -1 missing 'type' file,
inferring filestore from current/ dir
2022-01-24T22:13:31.883+0100 7fa52b9b6f00 0 starting osd.6 osd_data
/var/lib/ceph/osd/ceph-6 /var/lib/ceph/osd/ceph-6/journal
There was a CRC error at 22:04:25.
Node was rebooted by watchdog (Proxmox cluster), about 22:11:40 .
Other 3 OSDs on the same node, 2 with filestore/XFS and the other
bluestore don't report anything special.
We have removed giant log files, then performed a reboot, and everything
is OK, including that OSD.6 .
Anyone makes any sense about this? We have rebuilt bluestore OSD.15 on
this node before because some (small) corruption issue in DB.
# dpkg -l|grep ceph
ii ceph 16.2.7 amd64 distributed storage
and file system
ii ceph-base 16.2.7 amd64 common ceph
daemon libraries and management tools
ii ceph-common 16.2.7 amd64 common
utilities to mount and interact with a ceph storage cluster
ii ceph-fuse 16.2.7 amd64 FUSE-based
client for the Ceph distributed file system
ii ceph-mds 16.2.7 amd64 metadata server
for the ceph distributed file system
ii ceph-mgr 16.2.7 amd64 manager for the
ceph distributed storage system
ii ceph-mgr-modules-core 16.2.7 all
ceph manager modules which are always enabled
ii ceph-mon 16.2.7 amd64 monitor server
for the ceph storage system
ii ceph-osd 16.2.7 amd64 OSD server for
the ceph storage system
ii libcephfs2 16.2.7 amd64 Ceph
distributed file system client library
ii python3-ceph-argparse 16.2.7 all
Python 3 utility libraries for Ceph CLI
ii python3-ceph-common 16.2.7 all
Python 3 utility libraries for Ceph
ii python3-cephfs 16.2.7 amd64 Python 3
libraries for the Ceph libcephfs library
Thanks
Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project
Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx