Re: Pacific - XFS filestore OSD CRC error "infinite kernel crash dumps"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

And there was another OSD crash 2 days ago. This OSD.15 was recreated not long ago because of a DB corruption:

   -12> 2022-01-22T23:25:38.912+0100 7f58632db700  4 rocksdb: [flush_job.cc:394] [L] [JOB 1008] Level-0 flush table #14693: 5136 bytes OK    -11> 2022-01-22T23:25:38.912+0100 7f58632db700  4 rocksdb: (Original Log Time 2022/01/22-23:25:38.915354) [memtable_list.cc:445] [L] Level-0 commit table #14693 started    -10> 2022-01-22T23:25:38.912+0100 7f58632db700  4 rocksdb: (Original Log Time 2022/01/22-23:25:38.915708) [memtable_list.cc:501] [L] Level-0 commit table #14693: memtable #1 done     -9> 2022-01-22T23:25:38.912+0100 7f58632db700  4 rocksdb: (Original Log Time 2022/01/22-23:25:38.915715) EVENT_LOG_v1 {"time_micros": 1642890338915712, "job": 1008, "event": "flush_finished", "output_compression": "NoCompression", "lsm_state": [4, 26, 0, 0, 0, 0, 0], "immutable_memtables": 0}     -8> 2022-01-22T23:25:38.912+0100 7f58632db700  4 rocksdb: (Original Log Time 2022/01/22-23:25:38.915731) [db_impl/db_impl_compaction_flush.cc:203] [L] Level summary: files[4 26 0 0 0 0 0] max score 1.00

    -7> 2022-01-22T23:25:38.912+0100 7f58632db700  4 rocksdb: [db_impl/db_impl_files.cc:348] [JOB 1008] Try to delete WAL files size 765830441, prev total WAL file size 1074304495, number of live WAL files 4.

    -6> 2022-01-22T23:25:38.972+0100 7f5863adc700  3 rocksdb: [db_impl/db_impl_compaction_flush.cc:2807] Compaction error: Corruption: block checksum mismatch: expected 2762213168, got 627642509  in db/014659.sst offset 196130 size 3703     -5> 2022-01-22T23:25:38.972+0100 7f5863adc700  4 rocksdb: (Original Log Time 2022/01/22-23:25:38.974234) [compaction/compaction_job.cc:743] [O-1] compacted to: files[4 1 7 0 0 0 0] max score 0.15, MB/sec: 1041.4 rd, 0.0 wr, level 1, files in(4, 1) out(1) MB in(6.0, 54.6) out(0.0), read-write-amplify(10.1) write-amplify(0.0) Corruption: block checksum mismatch: expected 2762213168, got 627642509  in db/014659.sst offset 196130 size 3703, records in: 161439, records dropped: 148463 output_compression: NoCompression

    -4> 2022-01-22T23:25:38.972+0100 7f5863adc700  4 rocksdb: (Original Log Time 2022/01/22-23:25:38.974258) EVENT_LOG_v1 {"time_micros": 1642890338974247, "job": 1014, "event": "compaction_finished", "compaction_time_micros": 61089, "compaction_time_cpu_micros": 49334, "output_level": 1, "num_output_files": 1, "total_output_size": 4542771, "num_input_records": 161439, "num_output_records": 12976, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [4, 1, 7, 0, 0, 0, 0]}     -3> 2022-01-22T23:25:38.972+0100 7f5863adc700  2 rocksdb: [db_impl/db_impl_compaction_flush.cc:2341] Waiting after background compaction error: Corruption: block checksum mismatch: expected 2762213168, got 627642509  in db/014659.sst offset 196130 size 3703, Accumulated background error counts: 1     -2> 2022-01-22T23:25:39.008+0100 7f585ead2700 -1 rocksdb: submit_common error: Corruption: block checksum mismatch: expected 2762213168, got 627642509  in db/014659.sst offset 196130 size 3703 code = ^B Rocksdb transaction: PutCF( prefix = P key = 0x00000000000008'<.0000044304.00000000000124643116' value size = 236)
PutCF( prefix = P key = 0x00000000000008'<._fastinfo' value size = 186)
PutCF( prefix = O key = 0x7F8000000000000007B3AD'GN!rbd_data.a002336b8b4567.0000000000000179!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F0030000078 value size = 505) PutCF( prefix = O key = 0x7F8000000000000007B3AD'GN!rbd_data.a002336b8b4567.0000000000000179!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F value size = 445)
MergeCF( prefix = b key = 0x000000C050D00000 value size = 16)
MergeCF( prefix = b key = 0x000000C064300000 value size = 16)
MergeCF( prefix = b key = 0x000000C06E980000 value size = 16)
MergeCF( prefix = b key = 0x000000C06FF00000 value size = 16)
MergeCF( prefix = b key = 0x000000C070B00000 value size = 16)
MergeCF( prefix = b key = 0x000000C07C700000 value size = 16)
MergeCF( prefix = b key = 0x000000C07FB80000 value size = 16)
MergeCF( prefix = b key = 0x000000C080000000 value size = 16)
MergeCF( prefix = b key = 0x000000958CD80000 value size = 16)
    -1> 2022-01-22T23:25:39.008+0100 7f585ead2700 -1 ./src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)' thread 7f585ead2700 time 2022-01-22T23:25:39.010427+0100
./src/os/bluestore/BlueStore.cc: 11723: FAILED ceph_assert(r == 0)

 ceph version 16.2.7 (f9aa029788115b5df5eeee328f584156565ee5b7) pacific (stable)  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x55b6fd8edae6]
 2: /usr/bin/ceph-osd(+0xabcc71) [0x55b6fd8edc71]
 3: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x5ff) [0x55b6fdf03e8f]
 4: (BlueStore::_kv_sync_thread()+0x1a23) [0x55b6fdf41c03]
 5: (BlueStore::KVSyncThread::entry()+0xd) [0x55b6fdf6b0ed]
 6: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f586ec21ea7]
 7: clone()

     0> 2022-01-22T23:25:39.012+0100 7f585ead2700 -1 *** Caught signal (Aborted) **
 in thread 7f585ead2700 thread_name:bstore_kv_sync

 ceph version 16.2.7 (f9aa029788115b5df5eeee328f584156565ee5b7) pacific (stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f586ec2d140]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x16e) [0x55b6fd8edb30]
 5: /usr/bin/ceph-osd(+0xabcc71) [0x55b6fd8edc71]
 6: (BlueStore::_txc_apply_kv(BlueStore::TransContext*, bool)+0x5ff) [0x55b6fdf03e8f]
 7: (BlueStore::_kv_sync_thread()+0x1a23) [0x55b6fdf41c03]
 8: (BlueStore::KVSyncThread::entry()+0xd) [0x55b6fdf6b0ed]
 9: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f586ec21ea7]
 10: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

OSD.15 was created with v16.2.6 on 3rd of december.

I'm starting to think that there's something wrong with this node, either main board or RAM...

Thanks

El 25/1/22 a las 14:06, Eneko Lacunza escribió:
Hi,

Ok, on more thing. That OSD.6 had 3 crashes during the day due to heartbeat issues. No issues for weeks before that:

2022-01-24T17:53:19.259+0100 7f88fe21f700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f88e1fea700' had timed out after 15.000000954s
[...]
2022-01-24T17:53:41.703+0100 7f88fe21f700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f88e1fea700' had timed out after 15.000000954s 2022-01-24T17:54:07.267+0100 7f88f1982700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.6 down, but it is still running 2022-01-24T17:54:07.267+0100 7f88f1982700  0 log_channel(cluster) log [DBG] : map e44343 wrongly marked me down at e44342 2022-01-24T17:54:07.267+0100 7f88f1982700  1 osd.6 44343 start_waiting_for_healthy 2022-01-24T17:54:07.267+0100 7f88e27eb700  1 osd.6 pg_epoch: 44342 pg[4.1c5( v 44341'17530227 (44341'17528013,44341'17530227] local-lis/les=44228/44229 n=816 ec=16/16 lis/c=44228/44228 les/c/f=44229/44229/0 sis=44342) [8,4] r=-1 lpr=44342 pi=[44228,44342)/1 luod=0'0 lua=44203'17504662 crt=44341'17530227 lcod 44341'17530226 mlcod 0'0 active mbc={}] start_peering_interval up [8,6,4] -> [8,4], acting [8,6,4] -> [8,4], acting_primary 8 -> 8, up_primary 8 -> 8, role 1 -> -1, features acting 4540138297136906239 upacting 4540138297136906239
[...]
2022-01-24T17:54:07.327+0100 7f88e2fec700  1 osd.6 pg_epoch: 44343 pg[4.1dd( v 44341'41225907 (44341'41223724,44341'41225907] local-lis/les=44228/44229 n=805 ec=16/16 lis/c=44228/44228 les/c/f=44229/44229/0 sis= 44342) [8,4] r=-1 lpr=44342 pi=[44228,44342)/1 crt=44341'41225907 lcod 44341'41225906 mlcod 0'0 unknown NOTIFY mbc={}] state<Start>: transitioning to Stray 2022-01-24T17:54:07.619+0100 7f88f979e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f88e1fea700' had timed out after 15.000000954s 2022-01-24T17:54:07.619+0100 7f88f979e700  1 osd.6 44343 is_healthy false -- internal heartbeat failed 2022-01-24T17:54:07.619+0100 7f88f979e700  1 osd.6 44343 not healthy; waiting to boot 2022-01-24T17:54:07.619+0100 7f88f979e700  1 osd.6 44343 tick checking mon for new map 2022-01-24T17:54:08.583+0100 7f88f979e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f88e1fea700' had timed out after 15.000000954s 2022-01-24T17:54:08.583+0100 7f88f979e700  1 osd.6 44343 is_healthy false -- internal heartbeat failed 2022-01-24T17:54:08.583+0100 7f88f979e700  1 osd.6 44343 not healthy; waiting to boot
[...]
2022-01-24T17:55:33.236+0100 7f88f979e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f88e1fea700' had timed out after 15.000000954s 2022-01-24T17:55:33.236+0100 7f88f979e700  1 osd.6 44343 is_healthy false -- internal heartbeat failed 2022-01-24T17:55:33.236+0100 7f88f979e700  1 osd.6 44343 not healthy; waiting to boot 2022-01-24T17:55:34.268+0100 7f88f979e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f88e1fea700' had timed out after 15.000000954s 2022-01-24T17:55:34.268+0100 7f88f979e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f88e1fea700' had suicide timed out after 150.000000000s 2022-01-24T17:55:35.280+0100 7f88f979e700 -1 ./src/common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, ceph::coarse_mono_time)' thread 7f88f979e700 time 2022-01-24T17:55:35.273517+0100
./src/common/HeartbeatMap.cc: 85: ceph_abort_msg("hit suicide timeout")

 ceph version 16.2.7 (f9aa029788115b5df5eeee328f584156565ee5b7) pacific (stable)  1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd3) [0x5561916ec0df]  2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x504) [0x556191e4dc74]
 3: (ceph::HeartbeatMap::is_healthy()+0xe6) [0x556191e4ebe6]
 4: (OSD::_is_healthy()+0x2f) [0x5561917ccecf]
 5: (OSD::start_boot()+0x24) [0x5561917d55f4]
 6: (OSD::tick()+0x585) [0x5561917d6b25]
 7: (Context::complete(int)+0x9) [0x5561917edbf9]
 8: (CommonSafeTimer<std::mutex>::timer_thread()+0x11d) [0x556191e6c4bd]
 9: (CommonSafeTimerThread<std::mutex>::entry()+0xd) [0x556191e6d6ed]
 10: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f88ff8b7ea7]
 11: clone()

2022-01-24T17:55:35.284+0100 7f88f979e700 -1 *** Caught signal (Aborted) **
 in thread 7f88f979e700 thread_name:safe_timer

 ceph version 16.2.7 (f9aa029788115b5df5eeee328f584156565ee5b7) pacific (stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14140) [0x7f88ff8c3140]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x18a) [0x5561916ec196]  5: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x504) [0x556191e4dc74]
 6: (ceph::HeartbeatMap::is_healthy()+0xe6) [0x556191e4ebe6]
 7: (OSD::_is_healthy()+0x2f) [0x5561917ccecf]
 8: (OSD::start_boot()+0x24) [0x5561917d55f4]
 9: (OSD::tick()+0x585) [0x5561917d6b25]
 10: (Context::complete(int)+0x9) [0x5561917edbf9]
 11: (CommonSafeTimer<std::mutex>::timer_thread()+0x11d) [0x556191e6c4bd]
 12: (CommonSafeTimerThread<std::mutex>::entry()+0xd) [0x556191e6d6ed]
 13: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8ea7) [0x7f88ff8b7ea7]
 14: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

And the same starting at 17:57:32 and 18:01:52 .

El 25/1/22 a las 12:49, Eneko Lacunza escribió:
Hi all,

Tonight a node of our 5-node cluster had a issue with a XFS filestore OSD. Kernel started to log crash dumps and it hasn't stopped until the was no more free disk space affecting the entire node.

Space has been eaten by kernel logs in /var/log/kern.log /var/log/messages and /var/log/syslog that were taking EACH more than 20GB this morning.

syslog:

Jan 24 22:04:06 sanmarko kernel: [389213.340613] vmbr23: port 2(tap112i0) entered blocking state Jan 24 22:04:06 sanmarko kernel: [389213.340614] vmbr23: port 2(tap112i0) entered forwarding state Jan 24 22:04:25 sanmarko ceph-osd[1277578]: 2022-01-24T22:04:25.772+0100 7fdfe8961700 -1 log_channel(cluster) log [ERR] : 4.db full-object read crc 0x6c359b89 != expected 0x84ca4d31 on 4:db6f8075:::rbd_data.319eae6b8b4567.0000000000000237:head Jan 24 22:04:25 sanmarko ceph-osd[1277578]: 2022-01-24T22:04:25.772+0100 7fdfe8961700 -1 log_channel(cluster) log [ERR] : 4.db missing primary copy of 4:db6f8075:::rbd_data.319eae6b8b4567.0000000000000237:head, will try copies on 2,10 Jan 24 22:04:26 sanmarko kernel: [389233.797384] ------------[ cut here ]------------ Jan 24 22:04:26 sanmarko kernel: [389233.797387] WARNING: CPU: 8 PID: 1277607 at mm/truncate.c:395 truncate_inode_pages_range+0x4e5/0x5e0 Jan 24 22:04:26 sanmarko kernel: [389233.797392] Modules linked in: veth nfsv3 nfs_acl nfs lockd grace fscache netfs xfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel ipmi_devintf ipmi_msghandler nf_tables 8021q garp mrp bonding tls softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common snd_hda_codec_realtek ppdev snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi edac_mce_amd nouveau mxm_wmi snd_hda_intel kvm_amd video snd_intel_dspcfg snd_intel_sdw_acpi drm_ttm_helper snd_hda_codec ttm kvm irqbypass snd_hda_core drm_kms_helper crct10dif_pclmul snd_hwdep ghash_clmulni_intel cec snd_pcm aesni_intel rc_core i2c_algo_bit snd_timer crypto_simd cryptd fb_sys_fops snd syscopyarea sysfillrect rapl sysimgblt soundcore ccp pcspkr k10temp gigabyte_wmi wmi_bmof parport_pc zfs(PO) parport mac_hid zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm Jan 24 22:04:26 sanmarko kernel: [389233.797435]  ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq crc32_pclmul i2c_piix4 bnx2x mdio r8169 libcrc32c realtek xhci_pci xhci_pci_renesas ahci xhci_hcd libahci wmi gpio_amdpt gpio_generic Jan 24 22:04:26 sanmarko kernel: [389233.797451] CPU: 8 PID: 1277607 Comm: tp_fstore_op Tainted: P           O 5.13.19-3-pve #1 Jan 24 22:04:26 sanmarko kernel: [389233.797453] Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019 Jan 24 22:04:26 sanmarko kernel: [389233.797454] RIP: 0010:truncate_inode_pages_range+0x4e5/0x5e0 Jan 24 22:04:26 sanmarko kernel: [389233.797456] Code: 8b 45 08 48 8d 50 ff a8 01 4c 0f 45 ea 0f 1f 44 00 00 f0 41 ff 4d 34 0f 85 c1 fc ff ff 4c 89 ef e8 30 ce ff ff e9 b4 fc ff ff <0f> 0b e9 8b fd ff ff 49 8b 45 00 48 c1 e8 33 83 e0 07 83 f8 04 0f Jan 24 22:04:26 sanmarko kernel: [389233.797457] RSP: 0018:ffff9fbb0285fbf0 EFLAGS: 00010293 Jan 24 22:04:26 sanmarko kernel: [389233.797458] RAX: 0000000000000249 RBX: ffff9fbb0285fc88 RCX: 0000000000000000 Jan 24 22:04:26 sanmarko kernel: [389233.797459] RDX: 0000000000000000 RSI: ffffffffffffffc0 RDI: 0000000000000000 Jan 24 22:04:26 sanmarko kernel: [389233.797460] RBP: ffff9fbb0285fd38 R08: 0000000000000000 R09: fffffffffffffffe Jan 24 22:04:26 sanmarko kernel: [389233.797461] R10: 0000000000001000 R11: 0000000000000002 R12: ffff8eb7c315e418 Jan 24 22:04:26 sanmarko kernel: [389233.797462] R13: fffff965da0b5940 R14: ffff9fbb0285fc10 R15: 0000000000000000 Jan 24 22:04:26 sanmarko kernel: [389233.797463] FS: 00007fdffa427700(0000) GS:ffff8ece1e600000(0000) knlGS:0000000000000000 Jan 24 22:04:26 sanmarko kernel: [389233.797464] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 24 22:04:26 sanmarko kernel: [389233.797465] CR2: 00007fdffd74fc4a CR3: 000000016e3b6000 CR4: 0000000000350ee0
Jan 24 22:04:26 sanmarko kernel: [389233.797466] Call Trace:
Jan 24 22:04:26 sanmarko kernel: [389233.797467]  <TASK>
Jan 24 22:04:26 sanmarko kernel: [389233.797470]  ? xfs_stats_format+0x1f1/0x220 [xfs] Jan 24 22:04:26 sanmarko kernel: [389233.797524] truncate_inode_pages_final+0x4a/0x50
Jan 24 22:04:26 sanmarko kernel: [389233.797526] evict+0x16b/0x180
Jan 24 22:04:26 sanmarko kernel: [389233.797528] iput.part.0+0x173/0x1d0
Jan 24 22:04:26 sanmarko kernel: [389233.797529]  iput+0x1c/0x30
Jan 24 22:04:26 sanmarko kernel: [389233.797530] do_unlinkat+0x1cd/0x2d0
Jan 24 22:04:26 sanmarko kernel: [389233.797531] __x64_sys_unlink+0x42/0x60 Jan 24 22:04:26 sanmarko kernel: [389233.797532]  ? syscall_enter_from_user_mode+0x2b/0x40
Jan 24 22:04:26 sanmarko kernel: [389233.797534] do_syscall_64+0x61/0xb0
Jan 24 22:04:26 sanmarko kernel: [389233.797536]  ? syscall_exit_to_user_mode+0x27/0x50 Jan 24 22:04:26 sanmarko kernel: [389233.797537]  ? __x64_sys_newstat+0x16/0x20 Jan 24 22:04:26 sanmarko kernel: [389233.797539]  ? do_syscall_64+0x6e/0xb0 Jan 24 22:04:26 sanmarko kernel: [389233.797541]  ? exit_to_user_mode_prepare+0x37/0x1b0 Jan 24 22:04:26 sanmarko kernel: [389233.797543]  ? syscall_exit_to_user_mode+0x27/0x50 Jan 24 22:04:26 sanmarko kernel: [389233.797544]  ? __x64_sys_newstat+0x16/0x20 Jan 24 22:04:26 sanmarko kernel: [389233.797545]  ? do_syscall_64+0x6e/0xb0 Jan 24 22:04:26 sanmarko kernel: [389233.797547]  ? do_syscall_64+0x6e/0xb0 Jan 24 22:04:26 sanmarko kernel: [389233.797548]  ? irqentry_exit_to_user_mode+0x9/0x20 Jan 24 22:04:26 sanmarko kernel: [389233.797549]  ? irqentry_exit+0x19/0x30 Jan 24 22:04:26 sanmarko kernel: [389233.797550]  ? exc_page_fault+0x8f/0x170 Jan 24 22:04:26 sanmarko kernel: [389233.797551]  ? asm_exc_page_fault+0x8/0x30 Jan 24 22:04:26 sanmarko kernel: [389233.797552] entry_SYSCALL_64_after_hwframe+0x44/0xae
Jan 24 22:04:26 sanmarko kernel: [389233.797554] RIP: 0033:0x7fe00600eb57
Jan 24 22:04:26 sanmarko kernel: [389233.797555] Code: f0 ff ff 73 01 c3 48 8b 0d 36 d3 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 57 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 d3 0c 00 f7 d8 64 89 01 48 Jan 24 22:04:26 sanmarko kernel: [389233.797556] RSP: 002b:00007fdffa423778 EFLAGS: 00000246 ORIG_RAX: 0000000000000057 Jan 24 22:04:26 sanmarko kernel: [389233.797558] RAX: ffffffffffffffda RBX: 00007fdffa423940 RCX: 00007fe00600eb57 Jan 24 22:04:26 sanmarko kernel: [389233.797559] RDX: 00007fdffa423730 RSI: 000056505e77c2e0 RDI: 00005650784789a0 Jan 24 22:04:26 sanmarko kernel: [389233.797559] RBP: 000056505ffacf00 R08: 0000000000000007 R09: 000056505e77c200 Jan 24 22:04:26 sanmarko kernel: [389233.797560] R10: 00000002b2830000 R11: 0000000000000246 R12: 00005650797cac60 Jan 24 22:04:26 sanmarko kernel: [389233.797561] R13: 00007fdffa423840 R14: 00005650797cac60 R15: 00007fdffa423ac0
Jan 24 22:04:26 sanmarko kernel: [389233.797562]  </TASK>
Jan 24 22:04:26 sanmarko kernel: [389233.797562] ---[ end trace fc75378776329d96 ]--- Jan 24 22:04:26 sanmarko kernel: [389233.797566] ------------[ cut here ]------------ Jan 24 22:04:26 sanmarko kernel: [389233.797566] WARNING: CPU: 8 PID: 1277607 at mm/truncate.c:395 truncate_inode_pages_range+0x4e5/0x5e0 Jan 24 22:04:26 sanmarko kernel: [389233.797568] Modules linked in: veth nfsv3 nfs_acl nfs lockd grace fscache netfs xfs ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter bpfilter sctp ip6_udp_tunnel udp_tunnel ipmi_devintf ipmi_msghandler nf_tables 8021q garp mrp bonding tls softdog nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common snd_hda_codec_realtek ppdev snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi edac_mce_amd nouveau mxm_wmi snd_hda_intel kvm_amd video snd_intel_dspcfg snd_intel_sdw_acpi drm_ttm_helper snd_hda_codec ttm kvm irqbypass snd_hda_core drm_kms_helper crct10dif_pclmul snd_hwdep ghash_clmulni_intel cec snd_pcm aesni_intel rc_core i2c_algo_bit snd_timer crypto_simd cryptd fb_sys_fops snd syscopyarea sysfillrect rapl sysimgblt soundcore ccp pcspkr k10temp gigabyte_wmi wmi_bmof parport_pc zfs(PO) parport mac_hid zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_net vhost vhost_iotlb tap ib_iser rdma_cm iw_cm Jan 24 22:04:26 sanmarko kernel: [389233.797594]  ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm sunrpc ip_tables x_tables autofs4 btrfs blake2b_generic xor zstd_compress raid6_pq crc32_pclmul i2c_piix4 bnx2x mdio r8169 libcrc32c realtek xhci_pci xhci_pci_renesas ahci xhci_hcd libahci wmi gpio_amdpt gpio_generic Jan 24 22:04:26 sanmarko kernel: [389233.797602] CPU: 8 PID: 1277607 Comm: tp_fstore_op Tainted: P        W  O 5.13.19-3-pve #1 Jan 24 22:04:26 sanmarko kernel: [389233.797603] Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS M/B450 AORUS M, BIOS F50 11/27/2019 Jan 24 22:04:26 sanmarko kernel: [389233.797604] RIP: 0010:truncate_inode_pages_range+0x4e5/0x5e0 Jan 24 22:04:26 sanmarko kernel: [389233.797605] Code: 8b 45 08 48 8d 50 ff a8 01 4c 0f 45 ea 0f 1f 44 00 00 f0 41 ff 4d 34 0f 85 c1 fc ff ff 4c 89 ef e8 30 ce ff ff e9 b4 fc ff ff <0f> 0b e9 8b fd ff ff 49 8b 45 00 48 c1 e8 33 83 e0 07 83 f8 04 0f Jan 24 22:04:26 sanmarko kernel: [389233.797606] RSP: 0018:ffff9fbb0285fbf0 EFLAGS: 00010293 Jan 24 22:04:26 sanmarko kernel: [389233.797607] RAX: 0000000000000249 RBX: ffff9fbb0285fc88 RCX: 0000000000000000 Jan 24 22:04:26 sanmarko kernel: [389233.797608] RDX: 0000000000000000 RSI: ffffffffffffffc0 RDI: 0000000000000000 Jan 24 22:04:26 sanmarko kernel: [389233.797608] RBP: ffff9fbb0285fd38 R08: 0000000000000000 R09: fffffffffffffffe Jan 24 22:04:26 sanmarko kernel: [389233.797609] R10: 0000000000001000 R11: 0000000000000002 R12: ffff8eb7c315e418 Jan 24 22:04:26 sanmarko kernel: [389233.797610] R13: fffff965da0b5940 R14: ffff9fbb0285fc10 R15: 0000000000000000 Jan 24 22:04:26 sanmarko kernel: [389233.797610] FS: 00007fdffa427700(0000) GS:ffff8ece1e600000(0000) knlGS:0000000000000000 Jan 24 22:04:26 sanmarko kernel: [389233.797611] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 24 22:04:26 sanmarko kernel: [389233.797612] CR2: 00007fdffd74fc4a CR3: 000000016e3b6000 CR4: 0000000000350ee0
Jan 24 22:04:26 sanmarko kernel: [389233.797613] Call Trace:
Jan 24 22:04:26 sanmarko kernel: [389233.797613]  <TASK>
Jan 24 22:04:26 sanmarko kernel: [389233.797614]  ? xfs_stats_format+0x1f1/0x220 [xfs]
[... similar traces repeated endlessly...]
Jan 24 22:11:38 sanmarko kernel: [389666.155459] ---[ end trace fc75378776669ecc ]--- Jan 24 22:11:38 sanmarko kernel: [389666.155462] ------------[ cut here ]------------ Jan 24 22:11:38 sanmarko kernel: [389666.155462] WARNING: CPU: 2 PID: 1277607 at mm/truncate.c:395 truncate_inode_pages_range+0x4e5/0x5e0
Jan 24 22:11:38 sanmarko kernel: [389666.155464] Modules linked in: veth
[... reset by Proxmox watchdog ...]
Jan 24 22:13:24 sanmarko systemd-modules-load[551]: Inserted module 'iscsi_tcp' Jan 24 22:13:24 sanmarko kernel: [    0.000000] Linux version 5.13.19-3-pve (build@proxmox) (gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP PVE 5.13.19-6 (Tue, 11 Jan 2022 16:44:47 +0100) () Jan 24 22:13:24 sanmarko kernel: [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.13.19-3-pve root=/dev/mapper/pve-root ro quiet


Seems related to XFS. The only XFS filesystems on this server are for 3 filestore Ceph OSDs. Traces bounced from CPU 8, to 4, then 0 and finally 2.

Issue seems to be OSD.6 on that node (it was reported down for about 10 minues), relevant log:

2022-01-24T21:50:06.883+0100 7fdfea965700  0 log_channel(cluster) log [DBG] : 4.1c2 scrub ok 2022-01-24T21:57:17.344+0100 7fdffd42d700  1 leveldb: Level-0 table #1273240: started 2022-01-24T21:57:17.412+0100 7fdffd42d700  1 leveldb: Level-0 table #1273240: 1398346 bytes OK 2022-01-24T21:57:17.468+0100 7fdffd42d700  1 leveldb: Delete type=0 #1273237

2022-01-24T22:04:25.772+0100 7fdfe8961700 -1 log_channel(cluster) log [ERR] : 4.db full-object read crc 0x6c359b89 != expected 0x84ca4d31 on 4:db6f8075:::rbd_data.319eae6b8b4567.0000000000000237:head 2022-01-24T22:04:25.772+0100 7fdfe8961700 -1 log_channel(cluster) log [ERR] : 4.db missing primary copy of 4:db6f8075:::rbd_data.319eae6b8b4567.0000000000000237:head, will try copies on 2,10 2022-01-24T22:04:40.992+0100 7fe0045e9700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fdfe8961700' had timed out after 15.000000954s 2022-01-24T22:04:40.992+0100 7fe003de8700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fdfe8961700' had timed out after 15.000000954s
[... repeated lines with incrementing time...]
2022-01-24T22:05:04.468+0100 7fe003de8700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fdfea164700' had timed out after 15.000000954s 2022-01-24T22:05:04.468+0100 7fe003de8700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fdfea965700' had timed out after 15.000000954s 2022-01-24T22:13:31.851+0100 7fa52b9b6f00  0 set uid:gid to 64045:64045 (ceph:ceph) 2022-01-24T22:13:31.851+0100 7fa52b9b6f00  0 ceph version 16.2.7 (f9aa029788115b5df5eeee328f584156565ee5b7) pacific (stable), process ceph-osd, pid 2290 2022-01-24T22:13:31.851+0100 7fa52b9b6f00  0 pidfile_write: ignore empty --pid-file 2022-01-24T22:13:31.851+0100 7fa52b9b6f00 -1 missing 'type' file, inferring filestore from current/ dir 2022-01-24T22:13:31.883+0100 7fa52b9b6f00  0 starting osd.6 osd_data /var/lib/ceph/osd/ceph-6 /var/lib/ceph/osd/ceph-6/journal

There was a CRC error at 22:04:25.

Node was rebooted by watchdog (Proxmox cluster), about 22:11:40 .

Other 3 OSDs on the same node, 2 with filestore/XFS and the other bluestore don't report anything special.

We have removed giant log files, then performed a reboot, and everything is OK, including that OSD.6 .

Anyone makes any sense about this? We have rebuilt bluestore OSD.15 on this node before because some (small) corruption issue in DB.

# dpkg -l|grep ceph
ii  ceph 16.2.7                         amd64        distributed storage and file system ii  ceph-base 16.2.7                         amd64        common ceph daemon libraries and management tools ii  ceph-common 16.2.7                         amd64        common utilities to mount and interact with a ceph storage cluster ii  ceph-fuse 16.2.7                         amd64        FUSE-based client for the Ceph distributed file system ii  ceph-mds 16.2.7                         amd64        metadata server for the ceph distributed file system ii  ceph-mgr 16.2.7                         amd64        manager for the ceph distributed storage system ii  ceph-mgr-modules-core 16.2.7                         all          ceph manager modules which are always enabled ii  ceph-mon 16.2.7                         amd64        monitor server for the ceph storage system ii  ceph-osd 16.2.7                         amd64        OSD server for the ceph storage system ii  libcephfs2 16.2.7                         amd64        Ceph distributed file system client library ii  python3-ceph-argparse 16.2.7                         all          Python 3 utility libraries for Ceph CLI ii  python3-ceph-common 16.2.7                         all          Python 3 utility libraries for Ceph ii  python3-cephfs 16.2.7                         amd64        Python 3 libraries for the Ceph libcephfs library

Thanks



Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 |https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux