So, I just checked the logs on one of our smaller cluster and it looks like this error happened twice last week. The cluster contains 12x8TB OSDs without any SSDs as cache. And it started with octopus (so no upgrade from nautilus was performed) root@3cecef08a104:~# zgrep -i marked /var/log/ceph/ceph* /var/log/ceph/ceph.log.1.gz:2022-04-25T07:05:38.767452+0000 mon.3cecef5afb05 (mon.0) 390021 : cluster [INF] osd.6 marked itself dead as of e4011 /var/log/ceph/ceph.log.1.gz:2022-04-25T07:05:38.764813+0000 osd.6 (osd.6) 3312 : cluster [WRN] Monitor daemon marked osd.6 down, but it is still running /var/log/ceph/ceph.log.1.gz:2022-04-25T07:05:38.764827+0000 osd.6 (osd.6) 3313 : cluster [DBG] map e4011 wrongly marked me down at e4010 /var/log/ceph/ceph.log.2.gz:2022-04-24T06:54:53.726084+0000 osd.6 (osd.6) 3227 : cluster [WRN] Monitor daemon marked osd.6 down, but it is still running /var/log/ceph/ceph.log.2.gz:2022-04-24T06:54:53.726098+0000 osd.6 (osd.6) 3228 : cluster [DBG] map e3995 wrongly marked me down at e3994 /var/log/ceph/ceph.log.2.gz:2022-04-24T06:54:53.729151+0000 mon.3cecef5afb05 (mon.0) 382918 : cluster [INF] osd.6 marked itself dead as of e3995 Checking the log of said OSD shows that this happened after the compactation began: 2022-04-24T06:54:24.341+0000 7f7c6e152700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1650783264345276, "job": 1609, "event": "table_file_deletion", "file_number": 10916} 2022-04-24T06:54:24.341+0000 7f7c6e152700 4 rocksdb: [db/compaction_job.cc:1642] [default] [JOB 1610] Compacting 1@1 + 4@2 files to L2, score 1.76 2022-04-24T06:54:24.341+0000 7f7c6e152700 4 rocksdb: [db/compaction_job.cc:1648] [default] Compaction start summary: Base version 1609 Base level 1, inputs: [10975(43MB)], [10930(67MB) 10931(672KB) 10862(51MB) 10843(67MB)] 2022-04-24T06:54:24.341+0000 7f7c6e152700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1650783264345617, "job": 1610, "event": "compaction_started", "compaction_reason": "LevelMaxLevelSize", "files_L1": [10975], "files_L2": [10930, 10931, 10862, 10843], "score": 1.76282, "input_data_size": 241028245} 2022-04-24T06:54:25.609+0000 7f7c6e152700 4 rocksdb: [db/compaction_job.cc:1327] [default] [JOB 1610] Generated table #10986: 226423 keys, 70464004 bytes 2022-04-24T06:54:25.609+0000 7f7c6e152700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1650783265611705, "cf_name": "default", "job": 1610, "event": "table_file_creation", "file_number": 10986, "file_size": 70464004, "table_properties": {"data_size": 67109810, "index_size": 2787246, "filter_size": 566085, "raw_key_size": 42830142, "raw_average_key_size": 189, "raw_value_size": 55015026, "raw_average_value_size": 242, "num_data_blocks": 16577, "num_entries": 226423, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}} 2022-04-24T06:54:27.197+0000 7f7c6e152700 4 rocksdb: [db/compaction_job.cc:1327] [default] [JOB 1610] Generated table #10987: 233189 keys, 70525661 bytes 2022-04-24T06:54:27.197+0000 7f7c6e152700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1650783267203149, "cf_name": "default", "job": 1610, "event": "table_file_creation", "file_number": 10987, "file_size": 70525661, "table_properties": {"data_size": 67113431, "index_size": 2828386, "filter_size": 582981, "raw_key_size": 44106906, "raw_average_key_size": 189, "raw_value_size": 54869394, "raw_average_value_size": 235, "num_data_blocks": 16569, "num_entries": 233189, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}} 2022-04-24T06:54:28.597+0000 7f7c6e152700 4 rocksdb: [db/compaction_job.cc:1327] [default] [JOB 1610] Generated table #10988: 228113 keys, 70497098 bytes 2022-04-24T06:54:28.597+0000 7f7c6e152700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1650783268603183, "cf_name": "default", "job": 1610, "event": "table_file_creation", "file_number": 10988, "file_size": 70497098, "table_properties": {"data_size": 67111373, "index_size": 2814553, "filter_size": 570309, "raw_key_size": 43137333, "raw_average_key_size": 189, "raw_value_size": 54984875, "raw_average_value_size": 241, "num_data_blocks": 16584, "num_entries": 228113, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}} 2022-04-24T06:54:28.689+0000 7f7c76c00700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7c5d130700' had timed out after 15 2022-04-24T06:54:28.689+0000 7f7c75bfe700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f7c5d130700' had timed out after 15 .... 2022-04-24T06:54:53.521+0000 7f7c5d130700 1 heartbeat_map reset_timeout 'OSD::osd_op_tp thread 0x7f7c5d130700' had timed out after 15 2022-04-24T06:54:53.721+0000 7f7c66943700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.6 down, but it is still running 2022-04-24T06:54:53.721+0000 7f7c66943700 0 log_channel(cluster) log [DBG] : map e3995 wrongly marked me down at e3994 The cluster itself still got plenty of free space and doesn't have a huge IO load. So what next? How can I give more debug output to solve this issue. -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groüen Saal. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx