Re: OSDs cannot join, MON leader at 100%

Frank Schilder <frans@xxxxxx> · Wed, 17 Feb 2021 14:21:30 +0000

Hi Paul,

we might have found the reason for MONs going silly on our cluster. There is a message size parameter that seems way too large. We reduced it today from 10M (default) to 1M and didn't observe silly MONs since then:

ceph config set global osd_map_message_max_bytes $((1*1024*1024))

I cannot guarantee that this is the fix. However, I observed one window of a MON with high packet-out load after setting the above and it remained responsive and did not go to 100% CPU. Maybe worth a try? I will keep observing.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 10 February 2021 17:32:07
To: Paul Mezzanini; ceph-users@xxxxxxx
Subject:  Re: OSDs cannot join, MON leader at 100%

It has become a ot more severe after adding a large nubmer of disks. I added a tracker

https://tracker.ceph.com/issues/49231

In case you have additional information, feel free to add.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Paul Mezzanini <pfmeec@xxxxxxx>
Sent: 29 January 2021 20:04:12
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re: OSDs cannot join, MON leader at 100%

We are currently running 3 MONs.  When one goes into silly town the others get wedged and won't respond well.  I don't think more MONs would solve that... but I'm not sure.

--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfmeec@xxxxxxx

CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.
------------------------

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Friday, January 29, 2021 12:58 PM
To: Paul Mezzanini; ceph-users@xxxxxxx
Subject: Re: OSDs cannot join, MON leader at 100%

Hi Poul,

thanks for sharing. I have the MONs on 2x10G bonded active-active. They don't manage to saturate 10G, but the CPU core is overloaded.

How many MONs do you have? I believe I have never seen more than 2 to be in this state for an extended period of time. My plan is to go from 3 to 5, which would leave a subcluster of 3 and I would be less hesitant to restart an affected MON right away.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Paul Mezzanini <pfmeec@xxxxxxx>
Sent: 29 January 2021 17:44:42
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re: OSDs cannot join, MON leader at 100%

We've been watching our MONs go unresponsive with a saturated 10GbE NIC.  The problem seems to be aggravated by peering.  We were shrinking the PG count on one of our large pools and it was happening a bunch.  Once that finished it seemed to calm down.  Yesterday I had an OSD go down and as it was rebalancing we had another MON go into silly mode.  We recover from this situation by just restarting the MON process on the hung node.

We are running 14.2.15.

I wish I could tell you what the problem actually is and how to fix it.  At least we aren't alone in this failure mode.

--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfmeec@xxxxxxx

CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.
------------------------

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: Friday, January 29, 2021 5:22 AM
To: ceph-users@xxxxxxx
Subject:  OSDs cannot join, MON leader at 100%

Dear cephers,

I was doing some maintenance yesterday involving shutdown-power up cycles of ceph servers. With the last server I run into a problem. The server runs an MDS and a couple of OSDs. After reboot, the MDS joined the MDS cluster without problems, but the OSDs didn't come up. This was 1 out of 12 servers and I had no such problems with the other 11. I also observed that "ceph status" was responding very slow.

Upon further inspection, I found out that 2 of my 3 MONs (the leader and a peon) were running at 100% CPU. Client I/O was continuing, probably because the last cluster map remained valid. On our node performance monitoring I could see that the 2 busy MONs were showing extraordinary network activity.

This state lasted for over one hour. After the MONs settled down, the OSDs finally joined as well and everything went back to normal.

The other instance I have seen similar behaviour was, when I restarted a MON on an empty disk and the re-sync was extremely slow due to a too large value for mon_sync_max_payload_size. This time, I'm pretty sure it was MON-client communication; see below.

Are there any settings similar to mon_sync_max_payload_size that could influence responsiveness of MONs in a similar way?

Why do I suspect it is MON-client communication? In our monitoring, I do not see the huge amount of packages sent by the MONs arriving at any other ceph daemon. They seem to be distributed over client nodes, but since we have a large count of client nodes (>550) this is covered by the background network traffic. A second clue is that I have had such extended lock-ups before and, whenever I checked, I only observed these in case the leader had a large share of client sessions.

For example, yesterday the client session count per MON was:

ceph-01: 1339 (leader)
ceph-02:  189 (peon)
ceph-03:  839 (peon)

I usually restart the leader when such a critical distribution occurs. As long as the leader has the fewest client sessions, I never observe this problem.

Ceph version is 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable).

Thanks for any clues!

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx