Re: Leader election loop reappears

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear all,

I previously sent an email to the list regarding something that I called
"leader election" loop. The problem has reappeared several time and I don't
know how to proceed debugging or fixing this.

I have 6 nodes osd-{1..6} and monitors are on osd-{1..5}. I run ceph
15.2.14 using cephadm on CentOS 7.9 (kernel 3.10.0-1160.42.2.el7.x86_64).

The symptoms are (also see my previous email).

- `ceph -s` takes a long time or does not reeturn
--- I sometime see messages "monclient: get_monmap_and_config failed to get
config"
--- I sometimes see messages "problem getting command descriptions from
mon.osd-2" (always works with admin socket of course)
- I sometimes ee all daemons out of quorum in `ceph -s`
- different monitors go out of quorum and go back in
- leader election is reinitiated every few seconds
- the monitors appear to go correctly between "electing" and "peon" but the
issue is that leader election is performed every few seconds...

I have done all the checks in the "troubleshooting monitors" up to the
point where it says "reach out to the community". In particular, I checked
the mon_stats and each monitor sees all others on the correct public IP and
I can telnet to 3300 and 6789 from each monitor to all others.

I have bumped the nf_conntrack settings although I don't have any entries
in the syslog yet about dropping packages. `netstat -s` shows a few dropped
packages (e.g., 172 outgoing dropped, 18 dropped because of missing route).

Also, I have added public servers and the cluster itself to chrony.conf
(see below). The output of `chronyc sources -v` indicates to me that the
cluster itself is in sync and clock skew is below 10 ns.

I am able to inject the debug level 10/10 increase into the monitors, had
to repeat for one out of quroum monitor that first said "Error ENXIO:
problem getting command descriptions from mon.osd-5" but then accepted by
"ceph tell".

I have pulled the logs for two minutes while the cluster was running its
leader election loop and attached them. They are a couple of thousand lines
each and should show the problem. I'd be happy to send fewer or more lines,
though.

I'd be happy about any help or suggestions towards a resolution.

Best wishes,
Manuel

```
# 2>/dev/null sysctl -a | grep nf_ | egrep 'max|bucket'
net.netfilter.nf_conntrack_buckets = 2500096
net.netfilter.nf_conntrack_expect_max = 39060
net.netfilter.nf_conntrack_max = 10000000
net.netfilter.nf_conntrack_tcp_max_retrans = 3
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.nf_conntrack_max = 10000000
```

```
# from chrony.conf
server 172.16.35.140 iburst
server 172.16.35.141 iburst
server 172.16.35.142 iburst
server osd-1 iburst
server osd-2 iburst
server osd-3 iburst
server osd-4 iburst
server osd-5 iburst
server osd-6 iburst
server 0.de.pool.ntp.org iburst
server 1.de.pool.ntp.org iburst
server 2.de.pool.ntp.org iburst
server 3.de.pool.ntp.org iburst
```

```
# chronyc sources -v
210 Number of sources = 13

  .-- Source mode  '^' = server, '=' = peer, '#' = local clock.
 / .- Source state '*' = current synced, '+' = combined , '-' = not
combined,
| /   '?' = unreachable, 'x' = time may be in error, '~' = time too
variable.
||                                                 .- xxxx [ yyyy ] +/- zzzz
||      Reachability register (octal) -.           |  xxxx = adjusted
offset,
||      Log2(Polling interval) --.      |          |  yyyy = measured
offset,
||                                \     |          |  zzzz = estimated
error.
||                                 |    |           \
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^- 172.16.35.140                 3   6   377    55   +213us[ +213us] +/-
26ms
^+ 172.16.35.141                 2   6   377    63   +807us[ +807us] +/-
12ms
^+ 172.16.35.142                 3   9   377   253  +1488us[+1488us] +/-
7675us
^+ osd-1                         3   6   377    62   +145us[ +145us] +/-
7413us
^+ osd-2                         2   6   377    61  -6577ns[-6577ns] +/-
8108us
^+ osd-3                         4   6   377    50   +509us[ +509us] +/-
6810us
^+ osd-4                         4   6   377    54   +447us[ +447us] +/-
7231us
^+ osd-5                         3   6   377    52   +252us[ +252us] +/-
6738us
^+ osd-6                         2   6   377    56    -13us[  -13us] +/-
8563us
^+ funky.f5s.de                  2   8   377   207   +371us[ +371us] +/-
24ms
^- hetzner01.ziegenberg.at       2  10   377   445   +735us[ +685us] +/-
32ms
^* time1.uni-paderborn.de        1   9   377   253  -4246us[-4297us] +/-
9089us
^- 25000-021.cloud.services>     2  10   377   147   +832us[ +832us] +/-
48ms
```

On Wed, Sep 29, 2021 at 3:43 PM Manuel Holtgrewe <zyklenfrei@xxxxxxxxx>
wrote:

> Dear all,
>
> I was a bit too optimistic in my previous email. It looks like the leader
> election loop reappeared. I could fix it by stopping the rogue mon daemon
> but I don't know how to fix it for good.
>
> I'm running a 16.2.6 Ceph cluster on CentOS 7.9 servers (6 servers in
> total). I have about 35 HDDs in each server and 4 SSDs. The servers have
> about 250 GB of RAM, there is no memory pressure on any daemon. I have an
> identical mirror cluster that does not have the issue (but that one does
> not have its file system mounted elsewhere and is running no rgws). I have
> migrated both clusters recently to cephadm and then from octopus to pacific.
>
> The primary cluster has problems (pulled from the cluster before
> fixing/restarting mon daemon):
>
> - `ceph -s` and other commands feel pretty sluggish
> - `ceph -s` shows inconsistent results in the "health" section and
> "services" overview
> - cephfs clients hang and after rebooting the clients, mounting is not
> possible any more
> - `ceph config dump` prints "monclient: get_monmap_and_config failed to
> get config"
> - I have a mon leader election loop shown in its journalctl output on the
> bottom.
> - the primary mds daemon says things like "skipping upkeep work because
> connection to Monitors appears laggy" and "ms_deliver_dispatch: unhandled
> message 0x55ecdec1d340 client_session(request_renewcaps seq 88463) from
> client.60591566 v1:172.16.59.39:0/3197981635" in their journalctl output
>
> I tried to reboot the client that is supposedly not reacting to cache
> pressure but that did not help either. The servers are connected to the
> same VLT switch pair and use LACP 2x40GbE for cluster and 2x10GbE for
> public network. I have disabled firewalld on the nodes but that did not fix
> the problem either. I suspect that "laggy monitors" are caused more
> probable on the software side than on the network side.
>
> I took down the rogue mon.osd-1 with `docker stop` and it looks like the
> problem disappears then.
>
> To summarize: I suspect the cause to be connected to the mon daemons. I
> have found that similar problems have been reported a couple of times.
>
> What is the best way forward? It seems that the general suggestion for
> such cases is to just "ceph orch redeploy mon", so I did this.
>
> Is there any way to find out the root cause to get rid of it?
>
> Best wishes,
> Manuel
>
> osd-1 # ceph -s
>   cluster:
>     id:     55633ec3-6c0c-4a02-990c-0f87e0f7a01f
>     health: HEALTH_WARN
>             1 clients failing to respond to cache pressure
>             1/5 mons down, quorum osd-1,osd-2,osd-5,osd-4
>             Low space hindering backfill (add storage if this doesn't
> resolve itself): 5 pgs backfill_toofull
>
>   services:
>     mon: 5 daemons, quorum  (age 4h), out of quorum: osd-1, osd-2, osd-5,
> osd-4, osd-3
>     mgr: osd-4.oylrhe(active, since 2h), standbys: osd-1, osd-3,
> osd-5.jcfyqe, osd-2
>     mds: 1/1 daemons up, 1 standby
>     osd: 180 osds: 180 up (since 4h), 164 in (since 6h); 285 remapped pgs
>     rgw: 12 daemons active (6 hosts, 2 zones)
>
>   data:
>     volumes: 1/1 healthy
>     pools:   14 pools, 5322 pgs
>     objects: 263.18M objects, 944 TiB
>     usage:   1.4 PiB used, 639 TiB / 2.0 PiB avail
>     pgs:     25576348/789544299 objects misplaced (3.239%)
>              5026 active+clean
>              291  active+remapped+backfilling
>              5    active+remapped+backfill_toofull
>
>   io:
>     client:   165 B/s wr, 0 op/s rd, 0 op/s wr
>     recovery: 2.3 GiB/s, 652 objects/s
>
>   progress:
>     Global Recovery Event (53m)
>       [==========================..] (remaining: 3m)
>
> osd-1 # ceph health detail
> HEALTH_WARN 1 clients failing to respond to cache pressure; 1/5 mons down,
> quorum osd-1,osd-2,osd-5,osd-4; Low space hindering backfill (add storage
> if this doesn't resolve itself): 5 pgs backfill_toofull
> [WRN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
>     mds.cephfs.osd-1.qkzuas(mds.0): Client med-file1:med-file1 failing to
> respond to cache pressure client_id: 56229355
> [WRN] MON_DOWN: 1/5 mons down, quorum osd-1,osd-2,osd-5,osd-4
>     mon.osd-3 (rank 4) addr [v2:172.16.62.12:3300/0,v1:172.16.62.12:6789/0]
> is down (out of quorum)
> [WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this
> doesn't resolve itself): 5 pgs backfill_toofull
>     pg 3.23d is active+remapped+backfill_toofull, acting [145,128,87]
>     pg 3.33f is active+remapped+backfill_toofull, acting [133,24,107]
>     pg 3.3cb is active+remapped+backfill_toofull, acting [100,90,82]
>     pg 3.3fc is active+remapped+backfill_toofull, acting [155,27,106]
>     pg 3.665 is active+remapped+backfill_toofull, acting [153,73,114]
>
>
> osd-1 # journalctl -f -u
> ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxx-1.service
> -- Logs begin at Wed 2021-09-29 08:52:53 CEST. --
> Sep 29 15:05:32 osd-1 bash[423735]: debug 2021-09-29T13:05:32.214+0000
> 7f6e854ba700  0 mon.osd-1@0(electing) e11 handle_command
> mon_command({"prefix": "config rm", "who": "osd/host:osd-3", "name":
> "osd_memory_target"} v 0) v1
> Sep 29 15:05:32 osd-1 bash[423735]: debug 2021-09-29T13:05:32.214+0000
> 7f6e854ba700  0 log_channel(audit) log [INF] : from='mgr.66351528 '
> entity='' cmd=[{"prefix": "config rm", "who": "osd/host:osd-3", "name":
> "osd_memory_target"}]: dispatch
> Sep 29 15:05:32 osd-1 bash[423735]: debug 2021-09-29T13:05:32.398+0000
> 7f6e844b8700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:05:32 osd-1 bash[423735]: debug 2021-09-29T13:05:32.398+0000
> 7f6e844b8700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:05:32 osd-1 bash[423735]: debug 2021-09-29T13:05:32.799+0000
> 7f6e844b8700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:05:32 osd-1 bash[423735]: debug 2021-09-29T13:05:32.799+0000
> 7f6e844b8700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:05:32 osd-1 bash[423735]: debug 2021-09-29T13:05:32.810+0000
> 7f6e854ba700  0 mon.osd-1@0(electing) e11 handle_command
> mon_command({"prefix": "df", "detail": "detail"} v 0) v1
> Sep 29 15:05:32 osd-1 bash[423735]: debug 2021-09-29T13:05:32.810+0000
> 7f6e854ba700  0 log_channel(audit) log [DBG] : from='client.?
> 172.16.62.12:0/2081332311' entity='client.admin' cmd=[{"prefix": "df",
> "detail": "detail"}]: dispatch
> Sep 29 15:05:33 osd-1 bash[423735]: debug 2021-09-29T13:05:33.600+0000
> 7f6e844b8700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:05:33 osd-1 bash[423735]: debug 2021-09-29T13:05:33.600+0000
> 7f6e844b8700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:05:35 osd-1 bash[423735]: debug 2021-09-29T13:05:35.195+0000
> 7f6e89cc3700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:05:37 osd-1 bash[423735]: debug 2021-09-29T13:05:37.045+0000
> 7f6e87cbf700 -1 mon.osd-1@0(electing) e11 get_health_metrics reporting 85
> slow ops, oldest is mon_command([{prefix=config-key set,
> key=mgr/cephadm/host.osd-2}] v 0)
> Sep 29 15:05:37 osd-1 bash[423735]: debug 2021-09-29T13:05:37.205+0000
> 7f6e87cbf700  0 log_channel(cluster) log [INF] : mon.osd-1 is new leader,
> mons osd-1,osd-5,osd-4,osd-3 in quorum (ranks 0,2,3,4)
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:46.215+0000
> 7f6e854ba700  0 mon.osd-1@0(leader) e11 handle_command
> mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-2}] v 0) v1
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000
> 7f6e87cbf700 -1 mon.osd-1@0(leader) e11 get_health_metrics reporting 173
> slow ops, oldest is mon_command([{prefix=config-key set,
> key=mgr/cephadm/host.osd-2}] v 0)
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000
> 7f6e87cbf700  1 mon.osd-1@0(leader).paxos(paxos recovering c
> 29405655..29406327) collect timeout, calling fresh election
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN
> 1 clients failing to respond to cache pressure; 1/5 mons down, quorum
> osd-1,osd-2,osd-5,osd-4; Low space hindering backfill (add storage if this
> doesn't resolve itself): 5 pgs backfill_toofull
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] : [WRN] MDS_CLIENT_RECALL: 1
> clients failing to respond to cache pressure
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :
> mds.cephfs.osd-1.qkzuas(mds.0): Client med-file1:med-file1 failing to
> respond to cache pressure client_id: 56229355
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] : [WRN] MON_DOWN: 1/5 mons
> down, quorum osd-1,osd-2,osd-5,osd-4
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     mon.osd-3 (rank 4)
> addr [v2:172.16.62.12:3300/0,v1:172.16.62.12:6789/0] is down (out of
> quorum)
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] : [WRN] PG_BACKFILL_FULL:
> Low space hindering backfill (add storage if this doesn't resolve itself):
> 5 pgs backfill_toofull
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     pg 3.23d is
> active+remapped+backfill_toofull, acting [145,128,87]
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     pg 3.33f is
> active+remapped+backfill_toofull, acting [133,24,107]
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     pg 3.3cb is
> active+remapped+backfill_toofull, acting [100,90,82]
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     pg 3.3fc is
> active+remapped+backfill_toofull, acting [155,27,106]
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.508+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     pg 3.665 is
> active+remapped+backfill_toofull, acting [153,73,114]
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.509+0000
> 7f6e854ba700  0 log_channel(cluster) log [INF] : mon.osd-1 calling monitor
> election
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.509+0000
> 7f6e854ba700  1 paxos.0).electionLogic(26610) init, last seen epoch 26610
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.533+0000
> 7f6e854ba700  1 mon.osd-1@0(electing) e11 collect_metadata md126:  no
> unique device id for md126: fallback method has no model nor serial'
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.538+0000
> 7f6e854ba700  0 log_channel(cluster) log [INF] : mon.osd-1 calling monitor
> election
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.538+0000
> 7f6e854ba700  1 paxos.0).electionLogic(26613) init, last seen epoch 26613,
> mid-election, bumping
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.547+0000
> 7f6e854ba700  1 mon.osd-1@0(electing) e11 collect_metadata md126:  no
> unique device id for md126: fallback method has no model nor serial'
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.547+0000
> 7f6e854ba700  1 mon.osd-1@0(electing) e11 handle_timecheck drop
> unexpected msg
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.551+0000
> 7f6e854ba700  0 mon.osd-1@0(electing) e11 handle_command
> mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-6}] v 0) v1
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.555+0000
> 7f6e894c2700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.554+0000
> 7f6e854ba700  0 mon.osd-1@0(electing) e11 handle_command
> mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-5}] v 0) v1
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.555+0000
> 7f6e894c2700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.565+0000
> 7f6e854ba700  0 mon.osd-1@0(electing) e11 handle_command
> mon_command({"prefix": "config rm", "who": "osd/host:osd-4", "name":
> "osd_memory_target"} v 0) v1
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.565+0000
> 7f6e854ba700  0 log_channel(audit) log [INF] : from='mgr.66351528 '
> entity='' cmd=[{"prefix": "config rm", "who": "osd/host:osd-4", "name":
> "osd_memory_target"}]: dispatch
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.565+0000
> 7f6e854ba700  0 mon.osd-1@0(electing) e11 handle_command
> mon_command({"prefix": "config rm", "who": "osd/host:osd-3", "name":
> "osd_memory_target"} v 0) v1
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.565+0000
> 7f6e854ba700  0 log_channel(audit) log [INF] : from='mgr.66351528 '
> entity='' cmd=[{"prefix": "config rm", "who": "osd/host:osd-3", "name":
> "osd_memory_target"}]: dispatch
> Sep 29 15:05:51 osd-1 bash[423735]: debug 2021-09-29T13:05:51.572+0000
> 7f6e854ba700  0 log_channel(cluster) log [INF] : mon.osd-1 is new leader,
> mons osd-1,osd-2,osd-5,osd-4,osd-3 in quorum (ranks 0,1,2,3,4)
> Sep 29 15:05:52 osd-1 bash[423735]: debug 2021-09-29T13:05:52.830+0000
> 7f6e89cc3700  0 --1- [v2:172.16.62.10:3300/0,v1:172.16.62.10:6789/0] >>
>  conn(0x55629242f000 0x556289dde000 :6789 s=ACCEPTING pgs=0 cs=0
> l=0).handle_client_banner accept peer addr is really - (socket is v1:
> 172.16.35.183:47888/0)
> Sep 29 15:05:58 osd-1 bash[423735]: debug 2021-09-29T13:05:58.825+0000
> 7f6e894c2700  0 --1- [v2:172.16.62.10:3300/0,v1:172.16.62.10:6789/0] >>
>  conn(0x55629b6e8800 0x5562a32e3800 :6789 s=ACCEPTING pgs=0 cs=0
> l=0).handle_client_banner accept peer addr is really - (socket is v1:
> 172.16.35.182:42746/0)
> Sep 29 15:06:03 osd-1 bash[423735]: debug 2021-09-29T13:05:59.667+0000
> 7f6e854ba700  0 mon.osd-1@0(leader) e11 handle_command
> mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-2}] v 0) v1
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000
> 7f6e87cbf700 -1 mon.osd-1@0(leader) e11 get_health_metrics reporting 266
> slow ops, oldest is mon_command([{prefix=config-key set,
> key=mgr/cephadm/host.osd-2}] v 0)
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000
> 7f6e87cbf700  1 mon.osd-1@0(leader).paxos(paxos recovering c
> 29405655..29406327) collect timeout, calling fresh election
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN
> 1 clients failing to respond to cache pressure; 1/5 mons down, quorum
> osd-1,osd-2,osd-5,osd-4; Low space hindering backfill (add storage if this
> doesn't resolve itself): 5 pgs backfill_toofull
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] : [WRN] MDS_CLIENT_RECALL: 1
> clients failing to respond to cache pressure
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :
> mds.cephfs.osd-1.qkzuas(mds.0): Client med-file1:med-file1 failing to
> respond to cache pressure client_id: 56229355
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] : [WRN] MON_DOWN: 1/5 mons
> down, quorum osd-1,osd-2,osd-5,osd-4
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     mon.osd-3 (rank 4)
> addr [v2:172.16.62.12:3300/0,v1:172.16.62.12:6789/0] is down (out of
> quorum)
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] : [WRN] PG_BACKFILL_FULL:
> Low space hindering backfill (add storage if this doesn't resolve itself):
> 5 pgs backfill_toofull
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     pg 3.23d is
> active+remapped+backfill_toofull, acting [145,128,87]
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     pg 3.33f is
> active+remapped+backfill_toofull, acting [133,24,107]
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     pg 3.3cb is
> active+remapped+backfill_toofull, acting [100,90,82]
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     pg 3.3fc is
> active+remapped+backfill_toofull, acting [155,27,106]
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.034+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     pg 3.665 is
> active+remapped+backfill_toofull, acting [153,73,114]
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.058+0000
> 7f6e854ba700  0 log_channel(cluster) log [INF] : mon.osd-1 calling monitor
> election
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.058+0000
> 7f6e854ba700  1 paxos.0).electionLogic(26616) init, last seen epoch 26616
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.064+0000
> 7f6e854ba700  1 mon.osd-1@0(electing) e11 collect_metadata md126:  no
> unique device id for md126: fallback method has no model nor serial'
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.065+0000
> 7f6e854ba700  1 mon.osd-1@0(electing) e11 handle_timecheck drop
> unexpected msg
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.065+0000
> 7f6e854ba700  0 mon.osd-1@0(electing) e11 handle_command
> mon_command({"prefix": "status", "format": "json-pretty"} v 0) v1
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.065+0000
> 7f6e854ba700  0 log_channel(audit) log [DBG] : from='client.?
> 172.16.62.11:0/4154945587' entity='client.admin' cmd=[{"prefix":
> "status", "format": "json-pretty"}]: dispatch
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.068+0000
> 7f6e854ba700  0 mon.osd-1@0(electing) e11 handle_command
> mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-6}] v 0) v1
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.072+0000
> 7f6e854ba700  0 mon.osd-1@0(electing) e11 handle_command
> mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-5}] v 0) v1
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.082+0000
> 7f6e854ba700  0 mon.osd-1@0(electing) e11 handle_command
> mon_command({"prefix": "config rm", "who": "osd/host:osd-4", "name":
> "osd_memory_target"} v 0) v1
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.083+0000
> 7f6e854ba700  0 log_channel(audit) log [INF] : from='mgr.66351528 '
> entity='' cmd=[{"prefix": "config rm", "who": "osd/host:osd-4", "name":
> "osd_memory_target"}]: dispatch
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.083+0000
> 7f6e854ba700  0 mon.osd-1@0(electing) e11 handle_command
> mon_command({"prefix": "config rm", "who": "osd/host:osd-3", "name":
> "osd_memory_target"} v 0) v1
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.083+0000
> 7f6e854ba700  0 log_channel(audit) log [INF] : from='mgr.66351528 '
> entity='' cmd=[{"prefix": "config rm", "who": "osd/host:osd-3", "name":
> "osd_memory_target"}]: dispatch
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.287+0000
> 7f6e844b8700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.488+0000
> 7f6e89cc3700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.719+0000
> 7f6e854ba700  0 mon.osd-1@0(electing) e11 handle_command
> mon_command({"prefix": "df", "detail": "detail"} v 0) v1
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.719+0000
> 7f6e854ba700  0 log_channel(audit) log [DBG] : from='client.?
> 172.16.62.11:0/1624876515' entity='client.admin' cmd=[{"prefix": "df",
> "detail": "detail"}]: dispatch
> Sep 29 15:06:04 osd-1 bash[423735]: debug 2021-09-29T13:06:04.889+0000
> 7f6e894c2700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:06:05 osd-1 bash[423735]: debug 2021-09-29T13:06:05.691+0000
> 7f6e894c2700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:06:07 osd-1 bash[423735]: debug 2021-09-29T13:06:07.073+0000
> 7f6e894c2700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:06:07 osd-1 bash[423735]: debug 2021-09-29T13:06:07.288+0000
> 7f6e894c2700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:06:07 osd-1 bash[423735]: debug 2021-09-29T13:06:07.294+0000
> 7f6e894c2700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:06:07 osd-1 bash[423735]: debug 2021-09-29T13:06:07.393+0000
> 7f6e894c2700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:06:08 osd-1 bash[423735]: debug 2021-09-29T13:06:08.216+0000
> 7f6e894c2700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:06:09 osd-1 bash[423735]: debug 2021-09-29T13:06:09.034+0000
> 7f6e87cbf700 -1 mon.osd-1@0(electing) e11 get_health_metrics reporting
> 289 slow ops, oldest is mon_command([{prefix=config-key set,
> key=mgr/cephadm/host.osd-2}] v 0)
> Sep 29 15:06:09 osd-1 bash[423735]: debug 2021-09-29T13:06:09.064+0000
> 7f6e87cbf700  1 paxos.0).electionLogic(26617) init, last seen epoch 26617,
> mid-election, bumping
> Sep 29 15:06:09 osd-1 bash[423735]: debug 2021-09-29T13:06:09.087+0000
> 7f6e87cbf700  1 mon.osd-1@0(electing) e11 collect_metadata md126:  no
> unique device id for md126: fallback method has no model nor serial'
> Sep 29 15:06:09 osd-1 bash[423735]: debug 2021-09-29T13:06:09.101+0000
> 7f6e854ba700  0 log_channel(cluster) log [INF] : mon.osd-1 calling monitor
> election
> Sep 29 15:06:09 osd-1 bash[423735]: debug 2021-09-29T13:06:09.101+0000
> 7f6e854ba700  1 paxos.0).electionLogic(26621) init, last seen epoch 26621,
> mid-election, bumping
> Sep 29 15:06:09 osd-1 bash[423735]: debug 2021-09-29T13:06:09.110+0000
> 7f6e854ba700  1 mon.osd-1@0(electing) e11 collect_metadata md126:  no
> unique device id for md126: fallback method has no model nor serial'
> Sep 29 15:06:14 osd-1 bash[423735]: debug 2021-09-29T13:06:14.038+0000
> 7f6e87cbf700 -1 mon.osd-1@0(electing) e11 get_health_metrics reporting
> 289 slow ops, oldest is mon_command([{prefix=config-key set,
> key=mgr/cephadm/host.osd-2}] v 0)
> Sep 29 15:06:14 osd-1 bash[423735]: debug 2021-09-29T13:06:14.123+0000
> 7f6e87cbf700  0 log_channel(cluster) log [INF] : mon.osd-1 is new leader,
> mons osd-1,osd-5,osd-4,osd-3 in quorum (ranks 0,2,3,4)
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:22.796+0000
> 7f6e854ba700  0 mon.osd-1@0(leader) e11 handle_command
> mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-2}] v 0) v1
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000
> 7f6e87cbf700 -1 mon.osd-1@0(leader) e11 get_health_metrics reporting 423
> slow ops, oldest is mon_command([{prefix=config-key set,
> key=mgr/cephadm/host.osd-2}] v 0)
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000
> 7f6e87cbf700  1 mon.osd-1@0(leader).paxos(paxos recovering c
> 29405655..29406327) collect timeout, calling fresh election
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] : Health detail: HEALTH_WARN
> 1 clients failing to respond to cache pressure; 1/5 mons down, quorum
> osd-1,osd-2,osd-5,osd-4; Low space hindering backfill (add storage if this
> doesn't resolve itself): 5 pgs backfill_toofull
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] : [WRN] MDS_CLIENT_RECALL: 1
> clients failing to respond to cache pressure
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :
> mds.cephfs.osd-1.qkzuas(mds.0): Client med-file1:med-file1 failing to
> respond to cache pressure client_id: 56229355
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] : [WRN] MON_DOWN: 1/5 mons
> down, quorum osd-1,osd-2,osd-5,osd-4
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     mon.osd-3 (rank 4)
> addr [v2:172.16.62.12:3300/0,v1:172.16.62.12:6789/0] is down (out of
> quorum)
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] : [WRN] PG_BACKFILL_FULL:
> Low space hindering backfill (add storage if this doesn't resolve itself):
> 5 pgs backfill_toofull
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     pg 3.23d is
> active+remapped+backfill_toofull, acting [145,128,87]
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     pg 3.33f is
> active+remapped+backfill_toofull, acting [133,24,107]
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     pg 3.3cb is
> active+remapped+backfill_toofull, acting [100,90,82]
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     pg 3.3fc is
> active+remapped+backfill_toofull, acting [155,27,106]
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.223+0000
> 7f6e87cbf700  0 log_channel(cluster) log [WRN] :     pg 3.665 is
> active+remapped+backfill_toofull, acting [153,73,114]
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.224+0000
> 7f6e854ba700  0 log_channel(cluster) log [INF] : mon.osd-1 calling monitor
> election
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.224+0000
> 7f6e854ba700  1 paxos.0).electionLogic(26624) init, last seen epoch 26624
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.253+0000
> 7f6e854ba700  1 mon.osd-1@0(electing) e11 collect_metadata md126:  no
> unique device id for md126: fallback method has no model nor serial'
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.254+0000
> 7f6e854ba700  1 mon.osd-1@0(electing) e11 handle_timecheck drop
> unexpected msg
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.256+0000
> 7f6e854ba700  0 mon.osd-1@0(electing) e11 handle_command
> mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-6}] v 0) v1
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.258+0000
> 7f6e854ba700  0 mon.osd-1@0(electing) e11 handle_command
> mon_command([{prefix=config-key set, key=mgr/cephadm/host.osd-5}] v 0) v1
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.259+0000
> 7f6e854ba700  0 mon.osd-1@0(electing) e11 handle_command
> mon_command({"prefix": "config rm", "who": "osd/host:osd-4", "name":
> "osd_memory_target"} v 0) v1
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.259+0000
> 7f6e854ba700  0 log_channel(audit) log [INF] : from='mgr.66351528 '
> entity='' cmd=[{"prefix": "config rm", "who": "osd/host:osd-4", "name":
> "osd_memory_target"}]: dispatch
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.259+0000
> 7f6e854ba700  0 mon.osd-1@0(electing) e11 handle_command
> mon_command({"prefix": "config rm", "who": "osd/host:osd-3", "name":
> "osd_memory_target"} v 0) v1
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.259+0000
> 7f6e854ba700  0 log_channel(audit) log [INF] : from='mgr.66351528 '
> entity='' cmd=[{"prefix": "config rm", "who": "osd/host:osd-3", "name":
> "osd_memory_target"}]: dispatch
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.273+0000
> 7f6e854ba700  0 log_channel(cluster) log [INF] : mon.osd-1 calling monitor
> election
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.273+0000
> 7f6e854ba700  1 paxos.0).electionLogic(26627) init, last seen epoch 26627,
> mid-election, bumping
> Sep 29 15:06:27 osd-1 bash[423735]: debug 2021-09-29T13:06:27.282+0000
> 7f6e854ba700  1 mon.osd-1@0(electing) e11 collect_metadata md126:  no
> unique device id for md126: fallback method has no model nor serial'
> Sep 29 15:06:28 osd-1 bash[423735]: debug 2021-09-29T13:06:28.050+0000
> 7f6e844b8700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:06:28 osd-1 bash[423735]: debug 2021-09-29T13:06:28.250+0000
> 7f6e844b8700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
> Sep 29 15:06:28 osd-1 bash[423735]: debug 2021-09-29T13:06:28.651+0000
> 7f6e844b8700  1 mon.osd-1@0(electing) e11 handle_auth_request failed to
> assign global_id
>
> osd-1 # journalctl -f -u
> ceph-55633ec3-6c0c-4a02-990c-0f87e0f7a01f@xxxxxxxxxxxxxx-1.qkzuas.service
> Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000
> 7f994ec61700  0 ms_deliver_dispatch: unhandled message 0x55ecdc463500
> client_session(request_renewcaps seq 88463) from client.60598827 v1:
> 172.16.59.39:0/1389838619
> Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000
> 7f994ec61700  0 ms_deliver_dispatch: unhandled message 0x55ece3a0cfc0
> client_session(request_renewcaps seq 88463) from client.60598821 v1:
> 172.16.59.39:0/858534994
> Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000
> 7f994ec61700  0 ms_deliver_dispatch: unhandled message 0x55ece1e24540
> client_session(request_renewcaps seq 88459) from client.60591845 v1:
> 172.16.59.7:0/1705034209
> Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000
> 7f994ec61700  0 ms_deliver_dispatch: unhandled message 0x55ece055f340
> client_session(request_renewcaps seq 88462) from client.60598851 v1:
> 172.16.59.26:0/763945533
> Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000
> 7f994ec61700  0 ms_deliver_dispatch: unhandled message 0x55ecdcb97c00
> client_session(request_renewcaps seq 88459) from client.60591994 v1:
> 172.16.59.7:0/4158829178
> Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000
> 7f994ec61700  0 ms_deliver_dispatch: unhandled message 0x55ecdfa9bc00
> client_session(request_renewcaps seq 86286) from client.60712226 v1:
> 172.16.59.64:0/1098377799
> Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000
> 7f994ec61700  0 ms_deliver_dispatch: unhandled message 0x55ec336dc000
> client_session(request_renewcaps seq 88463) from client.60591563 v1:
> 172.16.59.39:0/1765846930
> Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000
> 7f994ec61700  0 ms_deliver_dispatch: unhandled message 0x55ecdae976c0
> client_session(request_renewcaps seq 86592) from client.60695401 v1:
> 172.16.59.27:0/2213843285
> Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000
> 7f994ec61700  0 ms_deliver_dispatch: unhandled message 0x55ecdf211a40
> client_session(request_renewcaps seq 88461) from client.60599085 v1:
> 172.16.59.19:0/1476359719
> Sep 29 15:19:52 osd-1 bash[254093]: debug 2021-09-29T13:19:52.349+0000
> 7f994ec61700  0 ms_deliver_dispatch: unhandled message 0x55ecdec1d340
> client_session(request_renewcaps seq 88463) from client.60591566 v1:
> 172.16.59.39:0/3197981635
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux