Re: BADAUTHORIZER in Nautilus, unknown PGs, slow peering, very slow client I/O

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Update, posting information from other posts before:

[08:09:40] server3.place6:~# ceph config-key dump | grep config/
    "config/global/auth_client_required": "cephx",
    "config/global/auth_cluster_required": "cephx",
    "config/global/auth_service_required": "cephx",
    "config/global/cluster_network": "2a0a:e5c0:2:1::/64",
    "config/global/ms_bind_ipv4": "false",
    "config/global/ms_bind_ipv6": "true",
    "config/global/osd_class_update_on_start": "false",
    "config/global/osd_pool_default_size": "3",
    "config/global/public_network": "2a0a:e5c0:2:1::/64",
    "config/mgr/mgr/balancer/active": "1",
    "config/mgr/mgr/balancer/max_misplaced": ".01",
    "config/mgr/mgr/balancer/mode": "upmap",
    "config/mgr/mgr/prometheus/rbd_stats_pools": "hdd,ssd,xruk-ssd-pool",
    "config/osd/osd_max_backfills": "1",
    "config/osd/osd_recovery_max_active": "1",
    "config/osd/osd_recovery_op_priority": "1",
[08:09:56] server3.place6:~#

To some degree finding osds with huge logs sizes that usually contain

2021-04-13 08:31:25.388 7fa4896a4700  0 --1- [2a0a:e5c0:2:1:21b:21ff:febb:68d8]:0/19492 >> v1:[2a0a:e5c0:2:1:21b:21ff:fe85:a3a2]:6881/29606 conn(0x55a04deadf80 0x55a0273ad000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER
2021-04-13 08:31:25.388 7fa48a6b7700  0 --1- [2a0a:e5c0:2:1:21b:21ff:febb:68d8]:0/19492 >> v1:[2a0a:e5c0:2:1:21b:21ff:fe85:a3a2]:6879/29606 conn(0x55a04eb16d80 0x55a02742c000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER

(with varying osds IPs) and restarting them seems to improve the
situation for a bit, but even newly restarted osds go back into the
BADAUTHORIZER state.

Nico Schottelius <nico.schottelius@xxxxxxxxxxx> writes:

> Good morning,
>
> I've look somewhat intensively through the list and it seems we are
> rather hard hit by this. Originally yesterday started on a mixed 14.2.9
> and 14.2.16 cluster (osds, mons were all 14.2.16).
>
> We started phasing in 7 new osds, 6 of them throttled by reweighting to
> 0.1.
> Symptoms are many unknown PGs, very long stuck in peering (hours), slow
> activating and the infamous BADAUTHORIZER message. Client I/O is almost
> 0, not only of the pool with new OSDs, but also of other pools.
>
> We tried restarting all OSDs one by one, which seemed to clear the
> unknown PGs, however after around 1h they came back to unknown state.
>
> Tuning the --osd-max-backfills=.. and  --osd-recovery-max-active=... to
> 1 does not improve the situation, so we are currently running at =7 for
> both of them.
>
> An excerpt from one of the osds which heavily logs this issue is below.
>
> Any pointer on how this problem was solved on nautilus before is much
> appreciated, as the issue started late last night.
>
> Best regards,
>
> Nico
>
>
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: (Original Log Time 2021/04/13-07:38:37.275255) EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "compaction_finished", "compaction_time_micros": 5288403, "compaction_time_cpu_micros": 2940370, "output_level": 2, "num_output_files": 6, "total_output_size": 370673432, "num_input_records": 983055, "num_output_records": 641964, "num_subcompactions": 1, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 5, 35, 0, 0, 0, 0]}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 421257}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 401289}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 401288}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 401278}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 401277}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 401258}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 401257}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 401256}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: [db/compaction_job.cc:1645] [default] [JOB 6] Compacting 1@1 + 5@2 files to L2, score 1.17
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: [db/compaction_job.cc:1649] [default] Compaction start summary: Base version 5 Base level 1, inputs: [421255(65MB)], [401318(65MB) 401319(65MB) 401320(65MB) 401321(65MB) 401322(65MB)]
>
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292317275255, "job": 6, "event": "compaction_started", "compaction_reason": "LevelMaxLevelSize", "files_L1": [421255], "files_L2": [401318, 401319, 401320, 401321, 401322], "score": 1.17224, "input_data_size": 413630461}
> 2021-04-13 07:38:37.307 7f15a0704700  0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6862/8751 conn(0x55d520fbcd80 0x55d520ffa000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> 2021-04-13 07:38:37.315 7f15a372c700  0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:5060]:6835/13651 conn(0x55d521b1bf80 0x55d521b19800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> 2021-04-13 07:38:37.707 7f15a0704700  0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6862/8751 conn(0x55d520fbcd80 0x55d520ffa000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> 2021-04-13 07:38:37.715 7f15a372c700  0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:5060]:6835/13651 conn(0x55d521b1bf80 0x55d521b19800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> 2021-04-13 07:38:38.119 7f159a469700  4 rocksdb: [db/compaction_job.cc:1332] [default] [JOB 6] Generated table #421271: 130853 keys, 68923910 bytes
> 2021-04-13 07:38:38.119 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1618292318119320, "cf_name": "default", "job": 6, "event": "table_file_creation", "file_number": 421271, "file_size": 68923910, "table_properties": {"data_size": 67111841, "index_size": 1483972, "filter_size": 327237, "raw_key_size": 10909111, "raw_average_key_size": 83, "raw_value_size": 62457370, "raw_average_value_size": 477, "num_data_blocks": 16807, "num_entries": 130853, "filter_policy_name": "rocksdb.BuiltinBloomFilter"}}
> 2021-04-13 07:38:38.331 7f15a0704700  0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febb:68f0]:6884/29245 conn(0x55d521b1a000 0x55d521b16800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> 2021-04-13 07:38:38.331 7f15a0704700  0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febb:68f0]:6819/17174 conn(0x55d521b1c400 0x55d521b38000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> 2021-04-13 07:38:38.507 7f15a0704700  0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6862/8751 conn(0x55d520fbcd80 0x55d520ffa000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> 2021-04-13 07:38:38.515 7f15a372c700  0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:5060]:6835/13651 conn(0x55d521b1bf80 0x55d521b19800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> 2021-04-13 07:38:38.731 7f15a0704700  0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febb:68f0]:6884/29245 conn(0x55d521b1a000 0x55d521b16800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER
> 2021-04-13 07:38:38.731 7f15a0704700  0 --1- [v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534] >> v1:[2a0a:e5c0:2:1:21b:21ff:febb:68f0]:6819/17174 conn(0x55d521b1c400 0x55d521b38000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=0).handle_connect_reply_2 connect got BADAUTHORIZER


--
Sustainable and modern Infrastructures by ungleich.ch
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux