Re: Strange 50K slow ops incident

Frank Schilder <frans@xxxxxx> · Thu, 3 Nov 2022 11:28:23 +0000

Hi Szabo,

its a switch-local network shared with an HPC cluster with spine-leaf topology. The storage nodes sit on leafs and the leafs all connect to the same spine. Everything with duplicated hardware and LACP bonding.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
Sent: 03 November 2022 12:24:07
To: Frank Schilder
Cc: ceph-users@xxxxxxx
Subject: Re:  Strange 50K slow ops incident

Are those connected to the same switches?

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx<mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

On 2022. Nov 3., at 17:34, Frank Schilder <frans@xxxxxx> wrote:

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

Hi all,

I just had a very weird incident on our production cluster. An OSD was reporting >50K slow ops. Upon further investigation I observed exceptionally high network traffic on 3 out of the 12 hosts in this OSD's pools, one of them was the host with the slow ops OSD (ceph-09); see the image here (bytes received): https://imgur.com/a/gPQDiq5. The incoming data bandwidth is about 700MB/s (or a factor 4) higher than on all other hosts. Strange thing is, that on this OSD is not part of any 3xreplicated pool. The 2 pools of this OSD are 8+2 and 8+3 EC pools. Hence, this is neither user- nor replication traffic.

It looks like 3 OSDs in that pool decided to have a private meeting and ignore everything around them.

My first attempt of recovery was:

ceph osd set norecover
ceph osd set norebalance
ceph osd out 669

And wait. Indeed, PGs peered and user IO bandwidth went up by a factor of 2. In addition, the slow ops count started falling. In the image, the execution of these commands is visible as the peak at 10:45. After about 3 minutes, the slow ops count was 0 and I set the OSD back to in and unset all flags. Nothing happened, the cluster just continued operating normally.

Does anyone have an explanation for what I observed? It looks a lot like a large amount fake traffic, 3 OSDs just sending packets in circles. During recovery, the OSD with 50K slow ops had nearly no disk IO, therefore I do not believe that this was actual IO. I rather suspect that it was internal communication going bonkers.

Since the impact is quite high it would be nice to have a pointer as to what might have happened.

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx