Please use the correct list: ceph-users@xxxxxxx Probably same problem I had. Try reducing mon_sync_max_payload_size=4096 and start a new MON. Should just take a few seconds to boot up. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder Sent: 29 March 2021 16:58:19 To: by morphin; ceph-users@xxxxxxxx Subject: Re: Cluster suspends when Add Mon or stop and start after a while. Probably same problem I had. Try reducing mon_sync_max_payload_size=4096 and start a new MON. Should just take a few seconds to boot up. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: by morphin <morphinwithyou@xxxxxxxxx> Sent: 28 March 2021 11:25:02 To: ceph-users@xxxxxxxx Subject: Cluster suspends when Add Mon or stop and start after a while. Hello! I have a cluster with Datacenter crushmap (A+B).(9+9 = 18 servers) The cluster started with v12.2.0 Luminous 4 years ago. All these years I upgraded the Cluster Luminous > Mimic > v14.2.16 Nautilus. Now I have a weird issue. When I add a mon or shutdown a while and start it up again, all the cluster suspends, ceph -s do not respond and other two monitors starting election while booting mon is syncing. (logs below) 2021-03-28 00:18:23.482 7fe2f3610700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:23.782 7fe2eee07700 -1 mon.SRV-SB-1@1(electing) e9 failed to get devid for : fallback method has serial ''but no model 2021-03-28 00:18:24.292 7fe2ede05700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:26.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9 get_health_metrics reporting 3919 slow ops, oldest is log(1 entries from seq 2031 at 2021-03-28 00:08:41.094522) 2021-03-28 00:18:29.782 7fe2f160c700 1 mon.SRV-SB-1@1(electing).elector(7899) init, last seen epoch 7899, mid-election, bumping 2021-03-28 00:18:29.812 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9 failed to get devid for : fallback method has serial ''but no model 2021-03-28 00:18:31.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9 get_health_metrics reporting 3951 slow ops, oldest is log(1 entries from seq 2031 at 2021-03-28 00:08:41.094522) 2021-03-28 00:18:31.872 7fe2f3610700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:32.072 7fe2f3610700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:32.482 7fe2f3610700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:33.282 7fe2ede05700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:34.812 7fe2f160c700 1 mon.SRV-SB-1@1(electing).elector(7901) init, last seen epoch 7901, mid-election, bumping 2021-03-28 00:18:34.842 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9 failed to get devid for : fallback method has serial ''but no model 2021-03-28 00:18:34.872 7fe2ede05700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:35.072 7fe2ede05700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:35.492 7fe2ede05700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:36.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9 get_health_metrics reporting 3989 slow ops, oldest is log(1 entries from seq 2031 at 2021-03-28 00:08:41.094522) 2021-03-28 00:18:36.292 7fe2f2e0f700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:39.842 7fe2f160c700 1 mon.SRV-SB-1@1(electing).elector(7903) init, last seen epoch 7903, mid-election, bumping 2021-03-28 00:18:39.872 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9 failed to get devid for : fallback method has serial ''but no model 2021-03-28 00:18:40.872 7fe2ede05700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:41.082 7fe2ede05700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:41.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9 get_health_metrics reporting 4027 slow ops, oldest is log(1 entries from seq 2031 at 2021-03-28 00:08:41.094522) 2021-03-28 00:18:41.492 7fe2f3610700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:41.812 7fe2eee07700 -1 mon.SRV-SB-1@1(electing) e9 failed to get devid for : fallback method has serial ''but no model 2021-03-28 00:18:42.312 7fe2f3610700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:43.882 7fe2ede05700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:44.082 7fe2ede05700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:44.492 7fe2ede05700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:45.302 7fe2ede05700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:46.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9 get_health_metrics reporting 4062 slow ops, oldest is log(1 entries from seq 2031 at 2021-03-28 00:08:41.094522) 2021-03-28 00:18:47.812 7fe2f160c700 1 mon.SRV-SB-1@1(electing).elector(7905) init, last seen epoch 7905, mid-election, bumping 2021-03-28 00:18:47.842 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9 failed to get devid for : fallback method has serial ''but no model 2021-03-28 00:18:51.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9 get_health_metrics reporting 4091 slow ops, oldest is log(1 entries from seq 2031 at 2021-03-28 00:08:41.094522) 2021-03-28 00:18:52.842 7fe2f160c700 1 mon.SRV-SB-1@1(electing).elector(7907) init, last seen epoch 7907, mid-election, bumping 2021-03-28 00:18:52.872 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9 failed to get devid for : fallback method has serial ''but no model 2021-03-28 00:18:52.882 7fe2f3610700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:53.082 7fe2f3610700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:53.492 7fe2f3610700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:54.302 7fe2f3610700 1 mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:18:56.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9 get_health_metrics reporting 4131 slow ops, oldest is log(1 entries from seq 2031 at 2021-03-28 00:08:41.094522) 2021-03-28 00:18:57.872 7fe2f160c700 1 mon.SRV-SB-1@1(electing).elector(7909) init, last seen epoch 7909, mid-election, bumping 2021-03-28 00:18:57.902 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9 failed to get devid for : fallback method has serial ''but no model 2021-03-28 00:19:21.851 7f9127602700 1 mon.SRV-SB5@2(synchronizing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:19:25.061 7f912ce0d700 1 mon.SRV-SB5@2(synchronizing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:19:31.481 7f912c60c700 1 mon.SRV-SB5@2(synchronizing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:19:44.291 7f912c60c700 1 mon.SRV-SB5@2(synchronizing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:19:59.311 7f912c60c700 1 mon.SRV-SB5@2(synchronizing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:20:14.321 7f912c60c700 1 mon.SRV-SB5@2(synchronizing) e9 handle_auth_request failed to assign global_id 2021-03-28 00:20:54.481 7f9128604700 4 rocksdb: [db/db_impl_write.cc:1470] [default] New memtable created with log file: #1158. Immutable memtables: 0. 2021-03-28 00:20:54.481 7f912de0f700 4 rocksdb: (Original Log Time 2021/03/28-00:20:54.490450) [db/db_impl_compaction_flush.cc:2073] Calling FlushMemTableToOutputFile with column family [default], flush slots available 1, compaction slots available 1, flush slots scheduled 1, compaction slots scheduled 0 2021-03-28 00:20:54.481 7f912de0f700 4 rocksdb: [db/flush_job.cc:322] [default] [JOB 5] Flushing memtable with next log file: 1158 2021-03-28 00:20:54.481 7f912de0f700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1616880054490507, "job": 5, "event": "flush_started", "num_memtables": 1, "num_entries": 391495, "num_deletes": 0, "total_data_size": 26756127, "memory_usage": 32861536, "flush_reason": "Write Buffer Full"} 2021-03-28 00:20:54.481 7f912de0f700 4 rocksdb: [db/flush_job.cc:351] [default] [JOB 5] Level-0 flush table #1159: started 2021-03-28 00:20:54.611 7f912de0f700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1616880054615407, "cf_name": "default", "job": 5, "event": "table_file_creation", "file_number": 1159, "file_size": 14913671, "table_properties": {"data_size": 13733422, "index_size": 200663, "filter_size": 978757, "raw_key_size": 18143237, "raw_average_key_size": 46, "raw_value_size": 7829900, "raw_average_value_size": 20, "num_data_blocks": 3398, "num_entries": 391495, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}} 2021-03-28 00:20:54.611 7f912de0f700 4 rocksdb: [db/flush_job.cc:392] [default] [JOB 5] Level-0 flush table #1159: 14913671 bytes OK 2021-03-28 00:20:54.611 7f912de0f700 4 rocksdb: (Original Log Time 2021/03/28-00:20:54.615451) [db/memtable_list.cc:380] [default] Level-0 commit table #1159 started 2021-03-28 00:20:54.611 7f912de0f700 4 rocksdb: (Original Log Time 2021/03/28-00:20:54.615892) [db/memtable_list.cc:428] [default] Level-0 commit table #1159: memtable #1 done 2021-03-28 00:20:54.611 7f912de0f700 4 rocksdb: (Original Log Time 2021/03/28-00:20:54.615900) EVENT_LOG_v1 {"time_micros": 1616880054615897, "job": 5, "event": "flush_finished", "output_compression": "NoCompression", "lsm_state": [4, 0, 0, 0, 0, 0, 2], "immutable_memtables": 0} 2021-03-28 00:20:54.611 7f912de0f700 4 rocksdb: (Original Log Time 2021/03/28-00:20:54.615925) [db/db_impl_compaction_flush.cc:201] [default] Level summary: base level 6 level multiplier 10.00 max bytes base 268435456 files[4 0 0 0 0 0 2] max score 1.00 2021-03-28 00:20:54.611 7f912de0f700 4 rocksdb: [db/db_impl_files.cc:356] [JOB 5] Try to delete WAL files size 24016022, prev total WAL file size 24798914, number of live WAL files 2. 2021-03-28 00:20:54.611 7f912e610700 4 rocksdb: [db/compaction_job.cc:1645] [default] [JOB 6] Compacting 4@0 + 2@6 files to L6, score 1.00 2021-03-28 00:20:54.611 7f912e610700 4 rocksdb: [db/compaction_job.cc:1649] [default] Compaction start summary: Base version 5 Base level 0, inputs: [1159(14MB) 1155(25MB) 1149(591KB) 1157(24MB)], [1146(68MB) 1147(53MB)] 2021-03-28 00:20:54.611 7f912e610700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1616880054616197, "job": 6, "event": "compaction_started", "compaction_reason": "LevelL0FilesNum", "files_L0": [1159, 1155, 1149, 1157], "files_L6": [1146, 1147], "score": 1, "input_data_size": 195345305} 2021-03-28 00:20:55.671 7f912e610700 4 rocksdb: [db/compaction_job.cc:1332] [default] [JOB 6] Generated table #1160: 547031 keys, 40744575 bytes 2021-03-28 00:20:55.671 7f912e610700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1616880055675107, "cf_name": "default", "job": 6, "event": "table_file_creation", "file_number": 1160, "file_size": 40744575, "table_properties": {"data_size": 39048586, "index_size": 327543, "filter_size": 1367621, "raw_key_size": 25391768, "raw_average_key_size": 46, "raw_value_size": 30761471, "raw_average_value_size": 56, "num_data_blocks": 6004, "num_entries": 547031, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}} 2021-03-28 00:20:55.671 7f912e610700 4 rocksdb: [db/compaction_job.cc:1395] [default] [JOB 6] Compacted 4@0 + 2@6 files to L6 => 40744575 bytes 2021-03-28 00:20:55.671 7f912e610700 4 rocksdb: (Original Log Time 2021/03/28-00:20:55.676525) [db/compaction_job.cc:768] [default] compacted to: base level 6 level multiplier 10.00 max bytes base 268435456 files[0 0 0 0 0 0 1] max score 0.00, MB/sec: 184.5 rd, 38.5 wr, level 6, files in(4, 2) out(1) MB in(64.7, 121.6) out(38.9), read-write-amplify(3.5) write-amplify(0.6) OK, records in: 3571913, records dropped: 3024882 output_compression: NoCompression 2021-03-28 00:20:55.671 7f912e610700 4 rocksdb: (Original Log Time 2021/03/28-00:20:55.676541) EVENT_LOG_v1 {"time_micros": 1616880055676534, "job": 6, "event": "compaction_finished", "compaction_time_micros": 1058962, "compaction_time_cpu_micros": 1047128, "output_level": 6, "num_output_files": 1, "total_output_size": 40744575, "num_input_records": 3571913, "num_output_records": 547031, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 0, 0, 0, 0, 0, 1]} 2021-03-28 00:20:55.671 7f912e610700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1616880055677454, "job": 6, "event": "table_file_deletion", "file_number": 1159} 2021-03-28 00:20:55.671 7f912e610700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1616880055678431, "job": 6, "event": "table_file_deletion", "file_number": 1157} 2021-03-28 00:20:55.671 7f912e610700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1616880055679410, "job": 6, "event": "table_file_deletion", "file_number": 1155} 2021-03-28 00:20:55.671 7f912e610700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1616880055679506, "job": 6, "event": "table_file_deletion", "file_number": 1149} 2021-03-28 00:20:55.671 7f912e610700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1616880055681388, "job": 6, "event": "table_file_deletion", "file_number": 1147} 2021-03-28 00:20:55.681 7f912e610700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1616880055684258, "job": 6, "event": "table_file_deletion", "file_number": 1146} As you can see in the leader mon switchs to election mod and gives these logs until the sync is finished. mon.SRV-SB-1@1(electing) e9 failed to get devid for : fallback method has serial ''but no model mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id The booting mon only gives this: mon.SRV-SB5@2(synchronizing) e9 handle_auth_request failed to assign global_id The booting mon prepares its *.stt tables and I watched the var/lib/ceph/mon/*/*/*.log file and I saw these logs: !osd_snapremoved_epoch_1_0013c460,�!osd_snapremoved_epoch_1_0013c462.� I guess after all these upgrades Ceph missed some variable or something else. The problem is huge because I can not bootstrap a mon without stopping entire cluster. It takes 30min to boot up... I tried to remove and add new mons but nothing changed. I've a backup service and it takes snapshots and backup to blockstorage and deletes the snapshot. It works for 3 years now and the snapremoved log in th mon log under "var/lib/ceph/mon/*/*/*.log" makes me wonder, what if all these snapshots makes huge bottleneck to the mon bootup? What should I do? _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx