Re: Flapping OSDs on pacific 16.2.10

J-P Methot <jp.methot@xxxxxxxxxxxxxxxxx> · Wed, 18 Jan 2023 08:49:54 -0500

At the network level we're using bonds (802.3ad). There are 2 nics, each 
with two 25gbps port. 1 port per nic is used for the public network, the 
other for the replication network. That suggests a network bandwidth of 
50gbps (in theory) for each network load. The network graph is showing 
me loads of around 100MB/sec on the public network interface, less on 
the replication network. No dropped packets or network errors reported. 
AFAIK, this is not getting overloaded.

On 1/18/23 08:28, Danny Webb wrote:
Do you have any network congestion or packet loss on the replication 
network? are you sharing nics between public / replication?  That is 
another metric that needs looking into.
------------------------------------------------------------------------
*From:* J-P Methot <jp.methot@xxxxxxxxxxxxxxxxx>
*Sent:* 18 January 2023 12:42
*To:* ceph-users <ceph-users@xxxxxxx>
*Subject:*  Flapping OSDs on pacific 16.2.10
CAUTION: This email originates from outside THG

Hi,

We have a full SSD production cluster running on Pacific 16.2.10 and
deployed with cephadm that is experiencing OSD flapping issues.
Essentially, random OSDs will get kicked out of the cluster and then
automatically brought back in a few times a day. As an example, let's
take the case of OSD.184 :

-It flapped 9 times between January 15th and 17th with the following log
message each time :  2023-01-15T16:33:19.903+0000 prepare_failure
osd.184 from osd.49 is reporting failure:1

-On January 17th, it complains that there are slow ops and spam its logs
with the following line : heartbeat_map is_healthy 'OSD::osd_op_tp
thread 0x7f346aa64700' had timed out after 15.000000954s

The storage node itself has over 30 GB of ram still available in cache
and the drives themselves only seldom peak at 100% usage and that never
lasts more than a few seconds. CPU usage is also constantly around 5%.
Considering there is no other error messages in any of the regular logs,
including the systemd logs, why would this OSD not reply to heartbeats?

--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

*Danny Webb*
Principal OpenStack Engineer
Danny.Webb@xxxxxxxxxxxxxxx

THG Ingenuity Logo
www.thg.com <https://www.thg.com>
<https://www.linkedin.com/company/thg-ingenuity/?originalSubdomain=uk> 
<https://twitter.com/thgingenuity?lang=en>

--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx