Dear all, I previously sent an email to the list regarding something that I called "leader election" loop. The problem has reappeared several time and I don't know how to proceed debugging or fixing this. I have 6 nodes osd-{1..6} and monitors are on osd-{1..5}. I run ceph 15.2.14 using cephadm on CentOS 7.9 (kernel 3.10.0-1160.42.2.el7.x86_64). The symptoms are (also see my previous email). - `ceph -s` takes a long time or does not reeturn --- I sometime see messages "monclient: get_monmap_and_config failed to get config" --- I sometimes see messages "problem getting command descriptions from mon.osd-2" (always works with admin socket of course) - I sometimes ee all daemons out of quorum in `ceph -s` - different monitors go out of quorum and go back in - leader election is reinitiated every few seconds - the monitors appear to go correctly between "electing" and "peon" but the issue is that leader election is performed every few seconds... I have done all the checks in the "troubleshooting monitors" up to the point where it says "reach out to the community". In particular, I checked the mon_stats and each monitor sees all others on the correct public IP and I can telnet to 3300 and 6789 from each monitor to all others. I have bumped the nf_conntrack settings although I don't have any entries in the syslog yet about dropping packages. `netstat -s` shows a few dropped packages (e.g., 172 outgoing dropped, 18 dropped because of missing route). Also, I have added public servers and the cluster itself to chrony.conf (see below). The output of `chronyc sources -v` indicates to me that the cluster itself is in sync and clock skew is below 10 ns. I am able to inject the debug level 10/10 increase into the monitors, had to repeat for one out of quroum monitor that first said "Error ENXIO: problem getting command descriptions from mon.osd-5" but then accepted by "ceph tell". I have pulled the logs for two minutes while the cluster was running its leader election loop and attached them. They are a couple of thousand lines each and should show the problem. I'd be happy to send fewer or more lines, though. I'd be happy about any help or suggestions towards a resolution. Best wishes, Manuel ``` # 2>/dev/null sysctl -a | grep nf_ | egrep 'max|bucket' net.netfilter.nf_conntrack_buckets = 2500096 net.netfilter.nf_conntrack_expect_max = 39060 net.netfilter.nf_conntrack_max = 10000000 net.netfilter.nf_conntrack_tcp_max_retrans = 3 net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300 net.nf_conntrack_max = 10000000 ``` ``` # from chrony.conf server 172.16.35.140 iburst server 172.16.35.141 iburst server 172.16.35.142 iburst server osd-1 iburst server osd-2 iburst server osd-3 iburst server osd-4 iburst server osd-5 iburst server osd-6 iburst server 0.de.pool.ntp.org iburst server 1.de.pool.ntp.org iburst server 2.de.pool.ntp.org iburst server 3.de.pool.ntp.org iburst ``` ``` # chronyc sources -v 210 Number of sources = 13 .-- Source mode '^' = server, '=' = peer, '#' = local clock. / .- Source state '*' = current synced, '+' = combined , '-' = not combined, | / '?' = unreachable, 'x' = time may be in error, '~' = time too variable. || .- xxxx [ yyyy ] +/- zzzz || Reachability register (octal) -. | xxxx = adjusted offset, || Log2(Polling interval) --. | | yyyy = measured offset, || \ | | zzzz = estimated error. || | | \ MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^- 172.16.35.140 3 6 377 55 +213us[ +213us] +/- 26ms ^+ 172.16.35.141 2 6 377 63 +807us[ +807us] +/- 12ms ^+ 172.16.35.142 3 9 377 253 +1488us[+1488us] +/- 7675us ^+ osd-1 3 6 377 62 +145us[ +145us] +/- 7413us ^+ osd-2 2 6 377 61 -6577ns[-6577ns] +/- 8108us ^+ osd-3 4 6 377 50 +509us[ +509us] +/- 6810us ^+ osd-4 4 6 377 54 +447us[ +447us] +/- 7231us ^+ osd-5 3 6 377 52 +252us[ +252us] +/- 6738us ^+ osd-6 2 6 377 56 -13us[ -13us] +/- 8563us ^+ funky.f5s.de 2 8 377 207 +371us[ +371us] +/- 24ms ^- hetzner01.ziegenberg.at 2 10 377 445 +735us[ +685us] +/- 32ms ^* time1.uni-paderborn.de 1 9 377 253 -4246us[-4297us] +/- 9089us ^- 25000-021.cloud.services> 2 10 377 147 +832us[ +832us] +/- 48ms ``` On Wed, Sep 29, 2021 at 3:43 PM Manuel Holtgrewe <zyklenfrei@xxxxxxxxx> wrote: > Dear all, > > I was a bit too optimistic in my previous email. It looks like the leader > election loop reappeared. I could fix it by stopping the rogue mon daemon > but I don't know how to fix it for good. > > I'm running a 16.2.6 Ceph cluster on CentOS 7.9 servers (6 servers in > total). I have about 35 HDDs in each server and 4 SSDs. The servers have > about 250 GB of RAM, there is no memory pressure on any daemon. I have an > identical mirror cluster that does not have the issue (but that one does > not have its file system mounted elsewhere and is running no rgws). I have > migrated both clusters recently to cephadm and then from octopus to pacific. > > The primary cluster has problems (pulled from the cluster before > fixing/restarting mon daemon): > > - `ceph -s` and other commands feel pretty sluggish > - `ceph -s` shows inconsistent results in the "health" section and > "services" overview > - cephfs clients hang and after rebooting the clients, mounting is not > possible any more > - `ceph config dump` prints "monclient: get_monmap_and_config failed to > get config" > - I have a mon leader election loop shown in its journalctl output on the > bottom. > - the primary mds daemon says things like "skipping upkeep work because > connection to Monitors appears laggy" and "ms_deliver_dispatch: unhandled > message 0x55ecdec1d340 client_session(request_renewcaps seq 88463) from > client.60591566 v1:172.16.59.39:0/3197981635" in their journalctl output > > I tried to reboot the client that is supposedly not reacting to cache > pressure but that did not help either. The servers are connected to the > same VLT switch pair and use LACP 2x40GbE for cluster and 2x10GbE for > public network. I have disabled firewalld on the nodes but that did not fix > the problem either. I suspect that "laggy monitors" are caused more > probable on the software side than on the network side. > > I took down the rogue mon.osd-1 with `docker stop` and it looks like the > problem disappears then. > > To summarize: I suspect the cause to be connected to the mon daemons. I > have found that similar problems have been reported a couple of times. > > What is the best way forward? It seems that the general suggestion for > such cases is to just "ceph orch redeploy mon", so I did this. > > Is there any way to find out the root cause to get rid of it? > > Best wishes, > Manuel > > osd-1 # ceph -s > cluster: > id: 55633ec3-6c0c-4a02-990c-0f87e0f7a01f > health: HEALTH_WARN > 1 clients failing to respond to cache pressure > 1/5 mons down, quorum osd-1,osd-2,osd-5,osd-4 > Low space hindering backfill (add storage if this doesn't > resolve itself): 5 pgs backfill_toofull > > services: > mon: 5 daemons, quorum (age 4h), out of quorum: osd-1, osd-2, osd-5, > osd-4, osd-3 > mgr: osd-4.oylrhe(active, since 2h), standbys: osd-1, osd-3, > osd-5.jcfyqe, osd-2 > mds: 1/1 daemons up, 1 standby > osd: 180 osds: 180 up (since 4h), 164 in (since 6h); 285 remapped pgs > rgw: 12 daemons active (6 hosts, 2 zones) > > data: > volumes: 1/1 healthy > pools: 14 pools, 5322 pgs > objects: 263.18M objects, 944 TiB > usage: 1.4 PiB used, 639 TiB / 2.0 PiB avail > pgs: 25576348/789544299 objects misplaced (3.239%) > 5026 active+clean > 291 active+remapped+backfilling > 5 active+remapped+backfill_toofull > > io: > client: 165 B/s wr, 0 op/s rd, 0 op/s wr > recovery: 2.3 GiB/s, 652 objects/s > > progress: > Global Recovery Event (53m) > [==========================..] (remaining: 3m) > > osd-1 # ceph health detail > HEALTH_WARN 1 clients failing to respond to cache pressure; 1/5 mons down, > quorum osd-1,osd-2,osd-5,osd-4; Low space hindering backfill (add storage > if this doesn't resolve itself): 5 pgs backfill_toofull > [WRN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure > mds.cephfs.osd-1.qkzuas(mds.0): Client med-file1:med-file1 failing to > respond to cache pressure client_id: 56229355 > [WRN] MON_DOWN: 1/5 mons down, quorum osd-1,osd-2,osd-5,osd-4 > mon.osd-3 (rank 4) addr [v2:172.16.62.12:3300/0,v1:172.16.62.12:6789/0] > is down (out of quorum) > [WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this > doesn't resolve itself): 5 pgs backfill_toofull > pg 3.23d is active+remapped+backfill_toofull, acting [145,128,87] > pg 3.33f is active+remapped+backfill_toofull, acting [133,24,107] > pg 3.3cb is active+remapped+backfill_toofull, acting [100,90,82] > pg 3.3fc is active+remapped+backfill_toofull, acting [155,27,106] > pg 3.665 is active+remapped+backfill_toofull, acting [153,73,114] > > > osd-1 # journalctl -f -u > ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxx-1.service > -- Logs begin at Wed 2021-09-29 08:52:53 CEST. -- > Sep 29 15:05:32 osd-1 bash[423735]: debug 2021-09-29T13:05:32.214+0000 > 7f6e854ba700 0 mon.osd-1@0(electing) e11 handle_command > mon_command({"prefix": "config rm", "who": "osd/host:osd-3", "name": > "osd_memory_target"} v 0) v1 > Sep 29 15:05:32 osd-1 bash[423735]: debug 2021-09-29T13:05:32.214+0000 > 7f6e854ba700 0 log_channel(audit) log [INF] : from='mgr.66351528 ' > entity='' cmd=[{"prefix": "config rm", "who": "osd/host:osd-3", "name": > "osd_memory_target"}]: dispatch > Sep 29 15:05:32 osd-1 bash[423735]: debug 2021-09-29T13:05:32.398+0000 > 7f6e844b8700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:05:32 osd-1 bash[423735]: debug 2021-09-29T13:05:32.398+0000 > 7f6e844b8700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:05:32 osd-1 bash[423735]: debug 2021-09-29T13:05:32.799+0000 > 7f6e844b8700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:05:32 osd-1 bash[423735]: debug 2021-09-29T13:05:32.799+0000 > 7f6e844b8700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:05:32 osd-1 bash[423735]: debug 2021-09-29T13:05:32.810+0000 > 7f6e854ba700 0 mon.osd-1@0(electing) e11 handle_command > mon_command({"prefix": "df", "detail": "detail"} v 0) v1 > Sep 29 15:05:32 osd-1 bash[423735]: debug 2021-09-29T13:05:32.810+0000 > 7f6e854ba700 0 log_channel(audit) log [DBG] : from='client.? > 172.16.62.12:0/2081332311' entity='client.admin' cmd=[{"prefix": "df", > "detail": "detail"}]: dispatch > Sep 29 15:05:33 osd-1 bash[423735]: debug 2021-09-29T13:05:33.600+0000 > 7f6e844b8700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:05:33 osd-1 bash[423735]: debug 2021-09-29T13:05:33.600+0000 > 7f6e844b8700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:05:35 osd-1 bash[423735]: debug 2021-09-29T13:05:35.195+0000 > 7f6e89cc3700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:05:37 osd-1 bash[423735]: debug 2021-09-29T13:05:37.045+0000 > 7f6e87cbf700 -1 mon.osd-1@0(electing) e11 get_health_metrics reporting 85 > slow ops, oldest is mon_command([{prefix=config-key set, > key=mgr/cephadm/host.osd-2}] v 0) > Sep 29 15:05:37 osd-1 bash[423735]: debug 2021-09-29T13:05:37.205+0000 > 7f6e87cbf700 0 log_channel(cluster) log [INF] : mon.osd-1 is new leader, > mons osd-1,osd-5,osd-4,osd-3 in quorum (ranks 0,2,3,4) > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:46.215+0000 > 7f6e854ba700 0 mon.osd-1@0(leader) e11 handle_command > mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-2}] v 0) v1 > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000 > 7f6e87cbf700 -1 mon.osd-1@0(leader) e11 get_health_metrics reporting 173 > slow ops, oldest is mon_command([{prefix=config-key set, > key=mgr/cephadm/host.osd-2}] v 0) > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000 > 7f6e87cbf700 1 mon.osd-1@0(leader).paxos(paxos recovering c > 29405655..29406327) collect timeout, calling fresh election > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN > 1 clients failing to respond to cache pressure; 1/5 mons down, quorum > osd-1,osd-2,osd-5,osd-4; Low space hindering backfill (add storage if this > doesn't resolve itself): 5 pgs backfill_toofull > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : [WRN] MDS_CLIENT_RECALL: 1 > clients failing to respond to cache pressure > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : > mds.cephfs.osd-1.qkzuas(mds.0): Client med-file1:med-file1 failing to > respond to cache pressure client_id: 56229355 > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : [WRN] MON_DOWN: 1/5 mons > down, quorum osd-1,osd-2,osd-5,osd-4 > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : mon.osd-3 (rank 4) > addr [v2:172.16.62.12:3300/0,v1:172.16.62.12:6789/0] is down (out of > quorum) > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : [WRN] PG_BACKFILL_FULL: > Low space hindering backfill (add storage if this doesn't resolve itself): > 5 pgs backfill_toofull > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : pg 3.23d is > active+remapped+backfill_toofull, acting [145,128,87] > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : pg 3.33f is > active+remapped+backfill_toofull, acting [133,24,107] > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : pg 3.3cb is > active+remapped+backfill_toofull, acting [100,90,82] > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : pg 3.3fc is > active+remapped+backfill_toofull, acting [155,27,106] > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : pg 3.665 is > active+remapped+backfill_toofull, acting [153,73,114] > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.509+0000 > 7f6e854ba700 0 log_channel(cluster) log [INF] : mon.osd-1 calling monitor > election > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.509+0000 > 7f6e854ba700 1 paxos.0).electionLogic(26610) init, last seen epoch 26610 > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.533+0000 > 7f6e854ba700 1 mon.osd-1@0(electing) e11 collect_metadata md126: no > unique device id for md126: fallback method has no model nor serial' > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.538+0000 > 7f6e854ba700 0 log_channel(cluster) log [INF] : mon.osd-1 calling monitor > election > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.538+0000 > 7f6e854ba700 1 paxos.0).electionLogic(26613) init, last seen epoch 26613, > mid-election, bumping > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.547+0000 > 7f6e854ba700 1 mon.osd-1@0(electing) e11 collect_metadata md126: no > unique device id for md126: fallback method has no model nor serial' > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.547+0000 > 7f6e854ba700 1 mon.osd-1@0(electing) e11 handle_timecheck drop > unexpected msg > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.551+0000 > 7f6e854ba700 0 mon.osd-1@0(electing) e11 handle_command > mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-6}] v 0) v1 > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.555+0000 > 7f6e894c2700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.554+0000 > 7f6e854ba700 0 mon.osd-1@0(electing) e11 handle_command > mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-5}] v 0) v1 > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.555+0000 > 7f6e894c2700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.565+0000 > 7f6e854ba700 0 mon.osd-1@0(electing) e11 handle_command > mon_command({"prefix": "config rm", "who": "osd/host:osd-4", "name": > "osd_memory_target"} v 0) v1 > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.565+0000 > 7f6e854ba700 0 log_channel(audit) log [INF] : from='mgr.66351528 ' > entity='' cmd=[{"prefix": "config rm", "who": "osd/host:osd-4", "name": > "osd_memory_target"}]: dispatch > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.565+0000 > 7f6e854ba700 0 mon.osd-1@0(electing) e11 handle_command > mon_command({"prefix": "config rm", "who": "osd/host:osd-3", "name": > "osd_memory_target"} v 0) v1 > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.565+0000 > 7f6e854ba700 0 log_channel(audit) log [INF] : from='mgr.66351528 ' > entity='' cmd=[{"prefix": "config rm", "who": "osd/host:osd-3", "name": > "osd_memory_target"}]: dispatch > Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.572+0000 > 7f6e854ba700 0 log_channel(cluster) log [INF] : mon.osd-1 is new leader, > mons osd-1,osd-2,osd-5,osd-4,osd-3 in quorum (ranks 0,1,2,3,4) > Sep 29 15:05:52 osd-1 bash[423735]: debug 2021-09-29T13:05:52.830+0000 > 7f6e89cc3700 0 --1- [v2:172.16.62.10:3300/0,v1:172.16.62.10:6789/0] >> > conn(0x55629242f000 0x556289dde000 :6789 s=ACCEPTING pgs=0 cs=0 > l=0).handle_client_banner accept peer addr is really - (socket is v1: > 172.16.35.183:47888/0) > Sep 29 15:05:58 osd-1 bash[423735]: debug 2021-09-29T13:05:58.825+0000 > 7f6e894c2700 0 --1- [v2:172.16.62.10:3300/0,v1:172.16.62.10:6789/0] >> > conn(0x55629b6e8800 0x5562a32e3800 :6789 s=ACCEPTING pgs=0 cs=0 > l=0).handle_client_banner accept peer addr is really - (socket is v1: > 172.16.35.182:42746/0) > Sep 29 15:06:03 osd-1 bash[423735]: debug 2021-09-29T13:05:59.667+0000 > 7f6e854ba700 0 mon.osd-1@0(leader) e11 handle_command > mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-2}] v 0) v1 > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000 > 7f6e87cbf700 -1 mon.osd-1@0(leader) e11 get_health_metrics reporting 266 > slow ops, oldest is mon_command([{prefix=config-key set, > key=mgr/cephadm/host.osd-2}] v 0) > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000 > 7f6e87cbf700 1 mon.osd-1@0(leader).paxos(paxos recovering c > 29405655..29406327) collect timeout, calling fresh election > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN > 1 clients failing to respond to cache pressure; 1/5 mons down, quorum > osd-1,osd-2,osd-5,osd-4; Low space hindering backfill (add storage if this > doesn't resolve itself): 5 pgs backfill_toofull > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : [WRN] MDS_CLIENT_RECALL: 1 > clients failing to respond to cache pressure > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : > mds.cephfs.osd-1.qkzuas(mds.0): Client med-file1:med-file1 failing to > respond to cache pressure client_id: 56229355 > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : [WRN] MON_DOWN: 1/5 mons > down, quorum osd-1,osd-2,osd-5,osd-4 > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : mon.osd-3 (rank 4) > addr [v2:172.16.62.12:3300/0,v1:172.16.62.12:6789/0] is down (out of > quorum) > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : [WRN] PG_BACKFILL_FULL: > Low space hindering backfill (add storage if this doesn't resolve itself): > 5 pgs backfill_toofull > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : pg 3.23d is > active+remapped+backfill_toofull, acting [145,128,87] > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : pg 3.33f is > active+remapped+backfill_toofull, acting [133,24,107] > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : pg 3.3cb is > active+remapped+backfill_toofull, acting [100,90,82] > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : pg 3.3fc is > active+remapped+backfill_toofull, acting [155,27,106] > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : pg 3.665 is > active+remapped+backfill_toofull, acting [153,73,114] > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.058+0000 > 7f6e854ba700 0 log_channel(cluster) log [INF] : mon.osd-1 calling monitor > election > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.058+0000 > 7f6e854ba700 1 paxos.0).electionLogic(26616) init, last seen epoch 26616 > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.064+0000 > 7f6e854ba700 1 mon.osd-1@0(electing) e11 collect_metadata md126: no > unique device id for md126: fallback method has no model nor serial' > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.065+0000 > 7f6e854ba700 1 mon.osd-1@0(electing) e11 handle_timecheck drop > unexpected msg > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.065+0000 > 7f6e854ba700 0 mon.osd-1@0(electing) e11 handle_command > mon_command({"prefix": "status", "format": "json-pretty"} v 0) v1 > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.065+0000 > 7f6e854ba700 0 log_channel(audit) log [DBG] : from='client.? > 172.16.62.11:0/4154945587' entity='client.admin' cmd=[{"prefix": > "status", "format": "json-pretty"}]: dispatch > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.068+0000 > 7f6e854ba700 0 mon.osd-1@0(electing) e11 handle_command > mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-6}] v 0) v1 > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.072+0000 > 7f6e854ba700 0 mon.osd-1@0(electing) e11 handle_command > mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-5}] v 0) v1 > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.082+0000 > 7f6e854ba700 0 mon.osd-1@0(electing) e11 handle_command > mon_command({"prefix": "config rm", "who": "osd/host:osd-4", "name": > "osd_memory_target"} v 0) v1 > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.083+0000 > 7f6e854ba700 0 log_channel(audit) log [INF] : from='mgr.66351528 ' > entity='' cmd=[{"prefix": "config rm", "who": "osd/host:osd-4", "name": > "osd_memory_target"}]: dispatch > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.083+0000 > 7f6e854ba700 0 mon.osd-1@0(electing) e11 handle_command > mon_command({"prefix": "config rm", "who": "osd/host:osd-3", "name": > "osd_memory_target"} v 0) v1 > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.083+0000 > 7f6e854ba700 0 log_channel(audit) log [INF] : from='mgr.66351528 ' > entity='' cmd=[{"prefix": "config rm", "who": "osd/host:osd-3", "name": > "osd_memory_target"}]: dispatch > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.287+0000 > 7f6e844b8700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.488+0000 > 7f6e89cc3700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.719+0000 > 7f6e854ba700 0 mon.osd-1@0(electing) e11 handle_command > mon_command({"prefix": "df", "detail": "detail"} v 0) v1 > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.719+0000 > 7f6e854ba700 0 log_channel(audit) log [DBG] : from='client.? > 172.16.62.11:0/1624876515' entity='client.admin' cmd=[{"prefix": "df", > "detail": "detail"}]: dispatch > Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.889+0000 > 7f6e894c2700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:06:05 osd-1 bash[423735]: debug 2021-09-29T13:06:05.691+0000 > 7f6e894c2700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:06:07 osd-1 bash[423735]: debug 2021-09-29T13:06:07.073+0000 > 7f6e894c2700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:06:07 osd-1 bash[423735]: debug 2021-09-29T13:06:07.288+0000 > 7f6e894c2700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:06:07 osd-1 bash[423735]: debug 2021-09-29T13:06:07.294+0000 > 7f6e894c2700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:06:07 osd-1 bash[423735]: debug 2021-09-29T13:06:07.393+0000 > 7f6e894c2700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:06:08 osd-1 bash[423735]: debug 2021-09-29T13:06:08.216+0000 > 7f6e894c2700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:06:09 osd-1 bash[423735]: debug 2021-09-29T13:06:09.034+0000 > 7f6e87cbf700 -1 mon.osd-1@0(electing) e11 get_health_metrics reporting > 289 slow ops, oldest is mon_command([{prefix=config-key set, > key=mgr/cephadm/host.osd-2}] v 0) > Sep 29 15:06:09 osd-1 bash[423735]: debug 2021-09-29T13:06:09.064+0000 > 7f6e87cbf700 1 paxos.0).electionLogic(26617) init, last seen epoch 26617, > mid-election, bumping > Sep 29 15:06:09 osd-1 bash[423735]: debug 2021-09-29T13:06:09.087+0000 > 7f6e87cbf700 1 mon.osd-1@0(electing) e11 collect_metadata md126: no > unique device id for md126: fallback method has no model nor serial' > Sep 29 15:06:09 osd-1 bash[423735]: debug 2021-09-29T13:06:09.101+0000 > 7f6e854ba700 0 log_channel(cluster) log [INF] : mon.osd-1 calling monitor > election > Sep 29 15:06:09 osd-1 bash[423735]: debug 2021-09-29T13:06:09.101+0000 > 7f6e854ba700 1 paxos.0).electionLogic(26621) init, last seen epoch 26621, > mid-election, bumping > Sep 29 15:06:09 osd-1 bash[423735]: debug 2021-09-29T13:06:09.110+0000 > 7f6e854ba700 1 mon.osd-1@0(electing) e11 collect_metadata md126: no > unique device id for md126: fallback method has no model nor serial' > Sep 29 15:06:14 osd-1 bash[423735]: debug 2021-09-29T13:06:14.038+0000 > 7f6e87cbf700 -1 mon.osd-1@0(electing) e11 get_health_metrics reporting > 289 slow ops, oldest is mon_command([{prefix=config-key set, > key=mgr/cephadm/host.osd-2}] v 0) > Sep 29 15:06:14 osd-1 bash[423735]: debug 2021-09-29T13:06:14.123+0000 > 7f6e87cbf700 0 log_channel(cluster) log [INF] : mon.osd-1 is new leader, > mons osd-1,osd-5,osd-4,osd-3 in quorum (ranks 0,2,3,4) > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:22.796+0000 > 7f6e854ba700 0 mon.osd-1@0(leader) e11 handle_command > mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-2}] v 0) v1 > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000 > 7f6e87cbf700 -1 mon.osd-1@0(leader) e11 get_health_metrics reporting 423 > slow ops, oldest is mon_command([{prefix=config-key set, > key=mgr/cephadm/host.osd-2}] v 0) > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000 > 7f6e87cbf700 1 mon.osd-1@0(leader).paxos(paxos recovering c > 29405655..29406327) collect timeout, calling fresh election > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN > 1 clients failing to respond to cache pressure; 1/5 mons down, quorum > osd-1,osd-2,osd-5,osd-4; Low space hindering backfill (add storage if this > doesn't resolve itself): 5 pgs backfill_toofull > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : [WRN] MDS_CLIENT_RECALL: 1 > clients failing to respond to cache pressure > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : > mds.cephfs.osd-1.qkzuas(mds.0): Client med-file1:med-file1 failing to > respond to cache pressure client_id: 56229355 > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : [WRN] MON_DOWN: 1/5 mons > down, quorum osd-1,osd-2,osd-5,osd-4 > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : mon.osd-3 (rank 4) > addr [v2:172.16.62.12:3300/0,v1:172.16.62.12:6789/0] is down (out of > quorum) > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : [WRN] PG_BACKFILL_FULL: > Low space hindering backfill (add storage if this doesn't resolve itself): > 5 pgs backfill_toofull > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : pg 3.23d is > active+remapped+backfill_toofull, acting [145,128,87] > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : pg 3.33f is > active+remapped+backfill_toofull, acting [133,24,107] > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : pg 3.3cb is > active+remapped+backfill_toofull, acting [100,90,82] > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : pg 3.3fc is > active+remapped+backfill_toofull, acting [155,27,106] > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000 > 7f6e87cbf700 0 log_channel(cluster) log [WRN] : pg 3.665 is > active+remapped+backfill_toofull, acting [153,73,114] > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.224+0000 > 7f6e854ba700 0 log_channel(cluster) log [INF] : mon.osd-1 calling monitor > election > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.224+0000 > 7f6e854ba700 1 paxos.0).electionLogic(26624) init, last seen epoch 26624 > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.253+0000 > 7f6e854ba700 1 mon.osd-1@0(electing) e11 collect_metadata md126: no > unique device id for md126: fallback method has no model nor serial' > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.254+0000 > 7f6e854ba700 1 mon.osd-1@0(electing) e11 handle_timecheck drop > unexpected msg > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.256+0000 > 7f6e854ba700 0 mon.osd-1@0(electing) e11 handle_command > mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-6}] v 0) v1 > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.258+0000 > 7f6e854ba700 0 mon.osd-1@0(electing) e11 handle_command > mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-5}] v 0) v1 > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.259+0000 > 7f6e854ba700 0 mon.osd-1@0(electing) e11 handle_command > mon_command({"prefix": "config rm", "who": "osd/host:osd-4", "name": > "osd_memory_target"} v 0) v1 > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.259+0000 > 7f6e854ba700 0 log_channel(audit) log [INF] : from='mgr.66351528 ' > entity='' cmd=[{"prefix": "config rm", "who": "osd/host:osd-4", "name": > "osd_memory_target"}]: dispatch > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.259+0000 > 7f6e854ba700 0 mon.osd-1@0(electing) e11 handle_command > mon_command({"prefix": "config rm", "who": "osd/host:osd-3", "name": > "osd_memory_target"} v 0) v1 > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.259+0000 > 7f6e854ba700 0 log_channel(audit) log [INF] : from='mgr.66351528 ' > entity='' cmd=[{"prefix": "config rm", "who": "osd/host:osd-3", "name": > "osd_memory_target"}]: dispatch > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.273+0000 > 7f6e854ba700 0 log_channel(cluster) log [INF] : mon.osd-1 calling monitor > election > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.273+0000 > 7f6e854ba700 1 paxos.0).electionLogic(26627) init, last seen epoch 26627, > mid-election, bumping > Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.282+0000 > 7f6e854ba700 1 mon.osd-1@0(electing) e11 collect_metadata md126: no > unique device id for md126: fallback method has no model nor serial' > Sep 29 15:06:28 osd-1 bash[423735]: debug 2021-09-29T13:06:28.050+0000 > 7f6e844b8700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:06:28 osd-1 bash[423735]: debug 2021-09-29T13:06:28.250+0000 > 7f6e844b8700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > Sep 29 15:06:28 osd-1 bash[423735]: debug 2021-09-29T13:06:28.651+0000 > 7f6e844b8700 1 mon.osd-1@0(electing) e11 handle_auth_request failed to > assign global_id > > osd-1 # journalctl -f -u > ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxxxxxxxxx-1.qkzuas.service > Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000 > 7f994ec61700 0 ms_deliver_dispatch: unhandled message 0x55ecdc463500 > client_session(request_renewcaps seq 88463) from client.60598827 v1: > 172.16.59.39:0/1389838619 > Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000 > 7f994ec61700 0 ms_deliver_dispatch: unhandled message 0x55ece3a0cfc0 > client_session(request_renewcaps seq 88463) from client.60598821 v1: > 172.16.59.39:0/858534994 > Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000 > 7f994ec61700 0 ms_deliver_dispatch: unhandled message 0x55ece1e24540 > client_session(request_renewcaps seq 88459) from client.60591845 v1: > 172.16.59.7:0/1705034209 > Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000 > 7f994ec61700 0 ms_deliver_dispatch: unhandled message 0x55ece055f340 > client_session(request_renewcaps seq 88462) from client.60598851 v1: > 172.16.59.26:0/763945533 > Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000 > 7f994ec61700 0 ms_deliver_dispatch: unhandled message 0x55ecdcb97c00 > client_session(request_renewcaps seq 88459) from client.60591994 v1: > 172.16.59.7:0/4158829178 > Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000 > 7f994ec61700 0 ms_deliver_dispatch: unhandled message 0x55ecdfa9bc00 > client_session(request_renewcaps seq 86286) from client.60712226 v1: > 172.16.59.64:0/1098377799 > Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000 > 7f994ec61700 0 ms_deliver_dispatch: unhandled message 0x55ec336dc000 > client_session(request_renewcaps seq 88463) from client.60591563 v1: > 172.16.59.39:0/1765846930 > Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000 > 7f994ec61700 0 ms_deliver_dispatch: unhandled message 0x55ecdae976c0 > client_session(request_renewcaps seq 86592) from client.60695401 v1: > 172.16.59.27:0/2213843285 > Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000 > 7f994ec61700 0 ms_deliver_dispatch: unhandled message 0x55ecdf211a40 > client_session(request_renewcaps seq 88461) from client.60599085 v1: > 172.16.59.19:0/1476359719 > Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000 > 7f994ec61700 0 ms_deliver_dispatch: unhandled message 0x55ecdec1d340 > client_session(request_renewcaps seq 88463) from client.60591566 v1: > 172.16.59.39:0/3197981635 > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx