Do you have CPU soft lock-ups around these times? We had these timeouts due to using the cfq/bfq disk schedulers with SSDs. The osd_op_tp thread timeout is typical when CPU lockups happen. Could be a sporadic problem with the disk IO path. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: J-P Methot <jp.methot@xxxxxxxxxxxxxxxxx> Sent: 18 January 2023 14:49:54 To: Danny Webb; ceph-users Subject: Re: Flapping OSDs on pacific 16.2.10 At the network level we're using bonds (802.3ad). There are 2 nics, each with two 25gbps port. 1 port per nic is used for the public network, the other for the replication network. That suggests a network bandwidth of 50gbps (in theory) for each network load. The network graph is showing me loads of around 100MB/sec on the public network interface, less on the replication network. No dropped packets or network errors reported. AFAIK, this is not getting overloaded. On 1/18/23 08:28, Danny Webb wrote: > Do you have any network congestion or packet loss on the replication > network? are you sharing nics between public / replication? That is > another metric that needs looking into. > ------------------------------------------------------------------------ > *From:* J-P Methot <jp.methot@xxxxxxxxxxxxxxxxx> > *Sent:* 18 January 2023 12:42 > *To:* ceph-users <ceph-users@xxxxxxx> > *Subject:* Flapping OSDs on pacific 16.2.10 > CAUTION: This email originates from outside THG > > Hi, > > We have a full SSD production cluster running on Pacific 16.2.10 and > deployed with cephadm that is experiencing OSD flapping issues. > Essentially, random OSDs will get kicked out of the cluster and then > automatically brought back in a few times a day. As an example, let's > take the case of OSD.184 : > > -It flapped 9 times between January 15th and 17th with the following log > message each time : 2023-01-15T16:33:19.903+0000 prepare_failure > osd.184 from osd.49 is reporting failure:1 > > -On January 17th, it complains that there are slow ops and spam its logs > with the following line : heartbeat_map is_healthy 'OSD::osd_op_tp > thread 0x7f346aa64700' had timed out after 15.000000954s > > The storage node itself has over 30 GB of ram still available in cache > and the drives themselves only seldom peak at 100% usage and that never > lasts more than a few seconds. CPU usage is also constantly around 5%. > Considering there is no other error messages in any of the regular logs, > including the systemd logs, why would this OSD not reply to heartbeats? > > -- > Jean-Philippe Méthot > Senior Openstack system administrator > Administrateur système Openstack sénior > PlanetHoster inc. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > *Danny Webb* > Principal OpenStack Engineer > Danny.Webb@xxxxxxxxxxxxxxx > > THG Ingenuity Logo > www.thg.com <https://www.thg.com> > <https://www.linkedin.com/company/thg-ingenuity/?originalSubdomain=uk> > <https://twitter.com/thgingenuity?lang=en> > -- Jean-Philippe Méthot Senior Openstack system administrator Administrateur système Openstack sénior PlanetHoster inc. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx