Update, posting information from other posts before: [08:09:40] server3.place6:~# ceph config-key dump | grep config/ "config/global/auth_client_required": "cephx", "config/global/auth_cluster_required": "cephx", "config/global/auth_service_required": "cephx", "config/global/cluster_network": "2a0a:e5c0:2:1::/64", "config/global/ms_bind_ipv4": "false", "config/global/ms_bind_ipv6": "true", "config/global/osd_class_update_on_start": "false", "config/global/osd_pool_default_size": "3", "config/global/public_network": "2a0a:e5c0:2:1::/64", "config/mgr/mgr/balancer/active": "1", "config/mgr/mgr/balancer/max_misplaced": ".01", "config/mgr/mgr/balancer/mode": "upmap", "config/mgr/mgr/prometheus/rbd_stats_pools": "hdd,ssd,xruk-ssd-pool", "config/osd/osd_max_backfills": "1", "config/osd/osd_recovery_max_active": "1", "config/osd/osd_recovery_op_priority": "1", [08:09:56] server3.place6:~# To some degree finding osds with huge logs sizes that usually contain 2021-04-13 08:31:25.388 7fa4896a4700 0 --1- [2a0a:e5c0:2:1:21b:21ff:febb:68d8]:0/19492 >> v1:[2a0a:e5c0:2:1:21b:21ff:fe85:a3a2]:6881/29606 conn(0x55a04deadf80 0x55a0273ad000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER 2021-04-13 08:31:25.388 7fa48a6b7700 0 --1- [2a0a:e5c0:2:1:21b:21ff:febb:68d8]:0/19492 >> v1:[2a0a:e5c0:2:1:21b:21ff:fe85:a3a2]:6879/29606 conn(0x55a04eb16d80 0x55a02742c000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER (with varying osds IPs) and restarting them seems to improve the situation for a bit, but even newly restarted osds go back into the BADAUTHORIZER state. Nico Schottelius <nico.schottelius@xxxxxxxxxxx> writes: > Good morning, > > I've look somewhat intensively through the list and it seems we are > rather hard hit by this. Originally yesterday started on a mixed 14.2.9 > and 14.2.16 cluster (osds, mons were all 14.2.16). > > We started phasing in 7 new osds, 6 of them throttled by reweighting to > 0.1. > Symptoms are many unknown PGs, very long stuck in peering (hours), slow > activating and the infamous BADAUTHORIZER message. Client I/O is almost > 0, not only of the pool with new OSDs, but also of other pools. > > We tried restarting all OSDs one by one, which seemed to clear the > unknown PGs, however after around 1h they came back to unknown state. > > Tuning the --osd-max-backfills=.. and --osd-recovery-max-active=... to > 1 does not improve the situation, so we are currently running at =7 for > both of them. > > An excerpt from one of the osds which heavily logs this issue is below. > > Any pointer on how this problem was solved on nautilus before is much > appreciated, as the issue started late last night. > > Best regards, > > Nico > > > 2021-04-13 07:38:37.275 7f159a469700 4 rocksdb: (Original Log Time 2021/04/13-07:38:37.275255) EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "compaction_finished", "compaction_time_micros": 5288403, "compaction_time_cpu_micros": 2940370, "output_level": 2, "num_output_files": 6, "total_output_size": 370673432, "num_input_records": 983055, "num_output_records": 641964, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 5, 35, 0, 0, 0, 0]} > 2021-04-13 07:38:37.275 7f159a469700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 421257} > 2021-04-13 07:38:37.275 7f159a469700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 401289} > 2021-04-13 07:38:37.275 7f159a469700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 401288} > 2021-04-13 07:38:37.275 7f159a469700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 401278} > 2021-04-13 07:38:37.275 7f159a469700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 401277} > 2021-04-13 07:38:37.275 7f159a469700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 401258} > 2021-04-13 07:38:37.275 7f159a469700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 401257} > 2021-04-13 07:38:37.275 7f159a469700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 401256} > 2021-04-13 07:38:37.275 7f159a469700 4 rocksdb: [db/compaction_job.cc:1645] [default] [JOB 6] Compacting 1@1 + 5@2 files to L2, score 1.17 > 2021-04-13 07:38:37.275 7f159a469700 4 rocksdb: [db/compaction_job.cc:1649] [default] Compaction start summary: Base version 5 Base level 1, inputs: [421255(65MB)], [401318(65MB) 401319(65MB) 401320(65MB) 401321(65MB) 401322(65MB)] > > 2021-04-13 07:38:37.275 7f159a469700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 6, "event": "compaction_started", "compaction_reason": "LevelMaxLevelSize", "files_L1": [421255], "files_L2": [401318, 401319, 401320, 401321, 401322], "score": 1.17224, "input_data_size": 413630461} > 2021-04-13 07:38:37.307 7f15a0704700 0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6862/8751 conn(0x55d520fbcd80 0x55d520ffa000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER > 2021-04-13 07:38:37.315 7f15a372c700 0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:5060]:6835/13651 conn(0x55d521b1bf80 0x55d521b19800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER > 2021-04-13 07:38:37.707 7f15a0704700 0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6862/8751 conn(0x55d520fbcd80 0x55d520ffa000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER > 2021-04-13 07:38:37.715 7f15a372c700 0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:5060]:6835/13651 conn(0x55d521b1bf80 0x55d521b19800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER > 2021-04-13 07:38:38.119 7f159a469700 4 rocksdb: [db/compaction_job.cc:1332] [default] [JOB 6] Generated table #421271: 130853 keys, 68923910 bytes > 2021-04-13 07:38:38.119 7f159a469700 4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292318119320, "cf_name": "default", "job": 6, "event": "table_file_creation", "file_number": 421271, "file_size": 68923910, "table_properties": {"data_size": 67111841, "index_size": 1483972, "filter_size": 327237, "raw_key_size": 10909111, "raw_average_key_size": 83, "raw_value_size": 62457370, "raw_average_value_size": 477, "num_data_blocks": 16807, "num_entries": 130853, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}} > 2021-04-13 07:38:38.331 7f15a0704700 0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febb:68f0]:6884/29245 conn(0x55d521b1a000 0x55d521b16800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER > 2021-04-13 07:38:38.331 7f15a0704700 0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febb:68f0]:6819/17174 conn(0x55d521b1c400 0x55d521b38000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER > 2021-04-13 07:38:38.507 7f15a0704700 0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6862/8751 conn(0x55d520fbcd80 0x55d520ffa000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER > 2021-04-13 07:38:38.515 7f15a372c700 0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:5060]:6835/13651 conn(0x55d521b1bf80 0x55d521b19800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER > 2021-04-13 07:38:38.731 7f15a0704700 0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febb:68f0]:6884/29245 conn(0x55d521b1a000 0x55d521b16800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER > 2021-04-13 07:38:38.731 7f15a0704700 0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febb:68f0]:6819/17174 conn(0x55d521b1c400 0x55d521b38000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER -- Sustainable and modern Infrastructures by ungleich.ch _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx