Re: Flapping OSDs on pacific 16.2.10

Frank Schilder <frans@xxxxxx> · Wed, 18 Jan 2023 16:17:19 +0000

Do you have CPU soft lock-ups around these times? We had these timeouts due to using the cfq/bfq disk schedulers with SSDs. The osd_op_tp thread timeout is typical when CPU lockups happen. Could be a sporadic problem with the disk IO path.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: J-P Methot <jp.methot@xxxxxxxxxxxxxxxxx>
Sent: 18 January 2023 14:49:54
To: Danny Webb; ceph-users
Subject:  Re: Flapping OSDs on pacific 16.2.10

At the network level we're using bonds (802.3ad). There are 2 nics, each
with two 25gbps port. 1 port per nic is used for the public network, the
other for the replication network. That suggests a network bandwidth of
50gbps (in theory) for each network load. The network graph is showing
me loads of around 100MB/sec on the public network interface, less on
the replication network. No dropped packets or network errors reported.
AFAIK, this is not getting overloaded.

On 1/18/23 08:28, Danny Webb wrote:
> Do you have any network congestion or packet loss on the replication
> network? are you sharing nics between public / replication?  That is
> another metric that needs looking into.
> ------------------------------------------------------------------------
> *From:* J-P Methot <jp.methot@xxxxxxxxxxxxxxxxx>
> *Sent:* 18 January 2023 12:42
> *To:* ceph-users <ceph-users@xxxxxxx>
> *Subject:*  Flapping OSDs on pacific 16.2.10
> CAUTION: This email originates from outside THG
>
> Hi,
>
> We have a full SSD production cluster running on Pacific 16.2.10 and
> deployed with cephadm that is experiencing OSD flapping issues.
> Essentially, random OSDs will get kicked out of the cluster and then
> automatically brought back in a few times a day. As an example, let's
> take the case of OSD.184 :
>
> -It flapped 9 times between January 15th and 17th with the following log
> message each time :  2023-01-15T16:33:19.903+0000 prepare_failure
> osd.184 from osd.49 is reporting failure:1
>
> -On January 17th, it complains that there are slow ops and spam its logs
> with the following line : heartbeat_map is_healthy 'OSD::osd_op_tp
> thread 0x7f346aa64700' had timed out after 15.000000954s
>
> The storage node itself has over 30 GB of ram still available in cache
> and the drives themselves only seldom peak at 100% usage and that never
> lasts more than a few seconds. CPU usage is also constantly around 5%.
> Considering there is no other error messages in any of the regular logs,
> including the systemd logs, why would this OSD not reply to heartbeats?
>
> --
> Jean-Philippe Méthot
> Senior Openstack system administrator
> Administrateur système Openstack sénior
> PlanetHoster inc.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> *Danny Webb*
> Principal OpenStack Engineer
> Danny.Webb@xxxxxxxxxxxxxxx
>
> THG Ingenuity Logo
> www.thg.com <https://www.thg.com>
> <https://www.linkedin.com/company/thg-ingenuity/?originalSubdomain=uk>
> <https://twitter.com/thgingenuity?lang=en>
>
--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx