Re: Cluster suspends when Add Mon or stop and start after a while.

Frank Schilder <frans@xxxxxx> · Mon, 29 Mar 2021 15:11:26 +0000

Please use the correct list: ceph-users@xxxxxxx

Probably same problem I had. Try reducing mon_sync_max_payload_size=4096 and start a new MON. Should just take a few seconds to boot up.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder
Sent: 29 March 2021 16:58:19
To: by morphin; ceph-users@xxxxxxxx
Subject: Re:  Cluster suspends when Add Mon or stop and start after a while.

Probably same problem I had. Try reducing mon_sync_max_payload_size=4096 and start a new MON. Should just take a few seconds to boot up.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: by morphin <morphinwithyou@xxxxxxxxx>
Sent: 28 March 2021 11:25:02
To: ceph-users@xxxxxxxx
Subject:  Cluster suspends when Add Mon or stop and start after a while.

Hello!

I have a cluster with Datacenter crushmap (A+B).(9+9 = 18 servers)
The cluster started with v12.2.0 Luminous 4 years ago.
All these years I upgraded the Cluster Luminous > Mimic > v14.2.16 Nautilus.
Now I have a weird issue. When I add a mon or shutdown a while and
start it up again, all the cluster suspends, ceph -s do not respond
and other two monitors starting election while booting mon is syncing.
(logs below)

2021-03-28 00:18:23.482 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:23.782 7fe2eee07700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:24.292 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:26.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 3919 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)
2021-03-28 00:18:29.782 7fe2f160c700  1
mon.SRV-SB-1@1(electing).elector(7899) init, last seen epoch 7899,
mid-election, bumping
2021-03-28 00:18:29.812 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:31.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 3951 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)
2021-03-28 00:18:31.872 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:32.072 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:32.482 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:33.282 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:34.812 7fe2f160c700  1
mon.SRV-SB-1@1(electing).elector(7901) init, last seen epoch 7901,
mid-election, bumping
2021-03-28 00:18:34.842 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:34.872 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:35.072 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:35.492 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:36.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 3989 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)
2021-03-28 00:18:36.292 7fe2f2e0f700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:39.842 7fe2f160c700  1
mon.SRV-SB-1@1(electing).elector(7903) init, last seen epoch 7903,
mid-election, bumping
2021-03-28 00:18:39.872 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:40.872 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:41.082 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:41.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 4027 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)
2021-03-28 00:18:41.492 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:41.812 7fe2eee07700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:42.312 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:43.882 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:44.082 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:44.492 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:45.302 7fe2ede05700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:46.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 4062 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)
2021-03-28 00:18:47.812 7fe2f160c700  1
mon.SRV-SB-1@1(electing).elector(7905) init, last seen epoch 7905,
mid-election, bumping
2021-03-28 00:18:47.842 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:51.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 4091 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)
2021-03-28 00:18:52.842 7fe2f160c700  1
mon.SRV-SB-1@1(electing).elector(7907) init, last seen epoch 7907,
mid-election, bumping
2021-03-28 00:18:52.872 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model
2021-03-28 00:18:52.882 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:53.082 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:53.492 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:54.302 7fe2f3610700  1 mon.SRV-SB-1@1(electing) e9
handle_auth_request failed to assign global_id
2021-03-28 00:18:56.102 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
get_health_metrics reporting 4131 slow ops, oldest is log(1 entries
from seq 2031 at 2021-03-28 00:08:41.094522)

2021-03-28 00:18:57.872 7fe2f160c700  1
mon.SRV-SB-1@1(electing).elector(7909) init, last seen epoch 7909,
mid-election, bumping
2021-03-28 00:18:57.902 7fe2f160c700 -1 mon.SRV-SB-1@1(electing) e9
failed to get devid for : fallback method has serial ''but no model

2021-03-28 00:19:21.851 7f9127602700  1 mon.SRV-SB5@2(synchronizing)
e9 handle_auth_request failed to assign global_id
2021-03-28 00:19:25.061 7f912ce0d700  1 mon.SRV-SB5@2(synchronizing)
e9 handle_auth_request failed to assign global_id
2021-03-28 00:19:31.481 7f912c60c700  1 mon.SRV-SB5@2(synchronizing)
e9 handle_auth_request failed to assign global_id
2021-03-28 00:19:44.291 7f912c60c700  1 mon.SRV-SB5@2(synchronizing)
e9 handle_auth_request failed to assign global_id
2021-03-28 00:19:59.311 7f912c60c700  1 mon.SRV-SB5@2(synchronizing)
e9 handle_auth_request failed to assign global_id
2021-03-28 00:20:14.321 7f912c60c700  1 mon.SRV-SB5@2(synchronizing)
e9 handle_auth_request failed to assign global_id
2021-03-28 00:20:54.481 7f9128604700  4 rocksdb:
[db/db_impl_write.cc:1470] [default] New memtable created with log
file: #1158. Immutable memtables: 0.

2021-03-28 00:20:54.481 7f912de0f700  4 rocksdb: (Original Log Time
2021/03/28-00:20:54.490450) [db/db_impl_compaction_flush.cc:2073]
Calling FlushMemTableToOutputFile with column family [default], flush
slots available 1, compaction slots available 1, flush slots scheduled
1, compaction slots scheduled 0
2021-03-28 00:20:54.481 7f912de0f700  4 rocksdb: [db/flush_job.cc:322]
[default] [JOB 5] Flushing memtable with next log file: 1158

2021-03-28 00:20:54.481 7f912de0f700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1616880054490507, "job": 5, "event": "flush_started",
"num_memtables": 1, "num_entries": 391495, "num_deletes": 0,
"total_data_size": 26756127, "memory_usage": 32861536, "flush_reason":
"Write Buffer Full"}
2021-03-28 00:20:54.481 7f912de0f700  4 rocksdb: [db/flush_job.cc:351]
[default] [JOB 5] Level-0 flush table #1159: started
2021-03-28 00:20:54.611 7f912de0f700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1616880054615407, "cf_name": "default", "job": 5,
"event": "table_file_creation", "file_number": 1159, "file_size":
14913671, "table_properties": {"data_size": 13733422, "index_size":
200663, "filter_size": 978757, "raw_key_size": 18143237,
"raw_average_key_size": 46, "raw_value_size": 7829900,
"raw_average_value_size": 20, "num_data_blocks": 3398, "num_entries":
391495, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}}
2021-03-28 00:20:54.611 7f912de0f700  4 rocksdb: [db/flush_job.cc:392]
[default] [JOB 5] Level-0 flush table #1159: 14913671 bytes OK
2021-03-28 00:20:54.611 7f912de0f700  4 rocksdb: (Original Log Time
2021/03/28-00:20:54.615451) [db/memtable_list.cc:380] [default]
Level-0 commit table #1159 started
2021-03-28 00:20:54.611 7f912de0f700  4 rocksdb: (Original Log Time
2021/03/28-00:20:54.615892) [db/memtable_list.cc:428] [default]
Level-0 commit table #1159: memtable #1 done
2021-03-28 00:20:54.611 7f912de0f700  4 rocksdb: (Original Log Time
2021/03/28-00:20:54.615900) EVENT_LOG_v1 {"time_micros":
1616880054615897, "job": 5, "event": "flush_finished",
"output_compression": "NoCompression", "lsm_state": [4, 0, 0, 0, 0, 0,
2], "immutable_memtables": 0}
2021-03-28 00:20:54.611 7f912de0f700  4 rocksdb: (Original Log Time
2021/03/28-00:20:54.615925) [db/db_impl_compaction_flush.cc:201]
[default] Level summary: base level 6 level multiplier 10.00 max bytes
base 268435456 files[4 0 0 0 0 0 2] max score 1.00

2021-03-28 00:20:54.611 7f912de0f700  4 rocksdb:
[db/db_impl_files.cc:356] [JOB 5] Try to delete WAL files size
24016022, prev total WAL file size 24798914, number of live WAL files
2.

2021-03-28 00:20:54.611 7f912e610700  4 rocksdb:
[db/compaction_job.cc:1645] [default] [JOB 6] Compacting 4@0 + 2@6
files to L6, score 1.00
2021-03-28 00:20:54.611 7f912e610700  4 rocksdb:
[db/compaction_job.cc:1649] [default] Compaction start summary: Base
version 5 Base level 0, inputs: [1159(14MB) 1155(25MB) 1149(591KB)
1157(24MB)], [1146(68MB) 1147(53MB)]

2021-03-28 00:20:54.611 7f912e610700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1616880054616197, "job": 6, "event":
"compaction_started", "compaction_reason": "LevelL0FilesNum",
"files_L0": [1159, 1155, 1149, 1157], "files_L6": [1146, 1147],
"score": 1, "input_data_size": 195345305}
2021-03-28 00:20:55.671 7f912e610700  4 rocksdb:
[db/compaction_job.cc:1332] [default] [JOB 6] Generated table #1160:
547031 keys, 40744575 bytes
2021-03-28 00:20:55.671 7f912e610700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1616880055675107, "cf_name": "default", "job": 6,
"event": "table_file_creation", "file_number": 1160, "file_size":
40744575, "table_properties": {"data_size": 39048586, "index_size":
327543, "filter_size": 1367621, "raw_key_size": 25391768,
"raw_average_key_size": 46, "raw_value_size": 30761471,
"raw_average_value_size": 56, "num_data_blocks": 6004, "num_entries":
547031, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}}
2021-03-28 00:20:55.671 7f912e610700  4 rocksdb:
[db/compaction_job.cc:1395] [default] [JOB 6] Compacted 4@0 + 2@6
files to L6 => 40744575 bytes
2021-03-28 00:20:55.671 7f912e610700  4 rocksdb: (Original Log Time
2021/03/28-00:20:55.676525) [db/compaction_job.cc:768] [default]
compacted to: base level 6 level multiplier 10.00 max bytes base
268435456 files[0 0 0 0 0 0 1] max score 0.00, MB/sec: 184.5 rd, 38.5
wr, level 6, files in(4, 2) out(1) MB in(64.7, 121.6) out(38.9),
read-write-amplify(3.5) write-amplify(0.6) OK, records in: 3571913,
records dropped: 3024882 output_compression: NoCompression

2021-03-28 00:20:55.671 7f912e610700  4 rocksdb: (Original Log Time
2021/03/28-00:20:55.676541) EVENT_LOG_v1 {"time_micros":
1616880055676534, "job": 6, "event": "compaction_finished",
"compaction_time_micros": 1058962, "compaction_time_cpu_micros":
1047128, "output_level": 6, "num_output_files": 1,
"total_output_size": 40744575, "num_input_records": 3571913,
"num_output_records": 547031, "num_subcompactions": 1,
"output_compression": "NoCompression", "num_single_delete_mismatches":
0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 0, 0, 0, 0, 0,
1]}
2021-03-28 00:20:55.671 7f912e610700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1616880055677454, "job": 6, "event":
"table_file_deletion", "file_number": 1159}
2021-03-28 00:20:55.671 7f912e610700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1616880055678431, "job": 6, "event":
"table_file_deletion", "file_number": 1157}
2021-03-28 00:20:55.671 7f912e610700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1616880055679410, "job": 6, "event":
"table_file_deletion", "file_number": 1155}
2021-03-28 00:20:55.671 7f912e610700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1616880055679506, "job": 6, "event":
"table_file_deletion", "file_number": 1149}
2021-03-28 00:20:55.671 7f912e610700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1616880055681388, "job": 6, "event":
"table_file_deletion", "file_number": 1147}
2021-03-28 00:20:55.681 7f912e610700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1616880055684258, "job": 6, "event":
"table_file_deletion", "file_number": 1146}

As you can see in the leader mon switchs to election mod and gives
these logs until the sync is finished.
mon.SRV-SB-1@1(electing) e9 failed to get devid for : fallback method
has serial ''but no model
mon.SRV-SB-1@1(electing) e9 handle_auth_request failed to assign global_id

The booting mon only gives this:
mon.SRV-SB5@2(synchronizing) e9 handle_auth_request failed to assign global_id

The booting mon prepares its *.stt tables and I watched the
var/lib/ceph/mon/*/*/*.log file and I saw these logs:
!osd_snapremoved_epoch_1_0013c460,�!osd_snapremoved_epoch_1_0013c462.�

I guess after all these upgrades Ceph missed some variable or something else.
The problem is huge because I can not bootstrap a mon without stopping
entire cluster. It takes 30min to boot up...
I tried to remove and add new mons but nothing changed.
I've a backup service and it takes snapshots and backup to
blockstorage and deletes the snapshot. It works for 3 years now and
the snapremoved log in th mon log under "var/lib/ceph/mon/*/*/*.log"
makes me wonder, what if all these snapshots makes huge bottleneck to
the mon bootup?

What should I do?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx