Strange 50K slow ops incident

Frank Schilder <frans@xxxxxx> · Thu, 3 Nov 2022 10:33:23 +0000

Hi all,

I just had a very weird incident on our production cluster. An OSD was reporting >50K slow ops. Upon further investigation I observed exceptionally high network traffic on 3 out of the 12 hosts in this OSD's pools, one of them was the host with the slow ops OSD (ceph-09); see the image here (bytes received): https://imgur.com/a/gPQDiq5. The incoming data bandwidth is about 700MB/s (or a factor 4) higher than on all other hosts. Strange thing is, that on this OSD is not part of any 3xreplicated pool. The 2 pools of this OSD are 8+2 and 8+3 EC pools. Hence, this is neither user- nor replication traffic.

It looks like 3 OSDs in that pool decided to have a private meeting and ignore everything around them.

My first attempt of recovery was:

ceph osd set norecover
ceph osd set norebalance
ceph osd out 669

And wait. Indeed, PGs peered and user IO bandwidth went up by a factor of 2. In addition, the slow ops count started falling. In the image, the execution of these commands is visible as the peak at 10:45. After about 3 minutes, the slow ops count was 0 and I set the OSD back to in and unset all flags. Nothing happened, the cluster just continued operating normally.

Does anyone have an explanation for what I observed? It looks a lot like a large amount fake traffic, 3 OSDs just sending packets in circles. During recovery, the OSD with 50K slow ops had nearly no disk IO, therefore I do not believe that this was actual IO. I rather suspect that it was internal communication going bonkers.

Since the impact is quite high it would be nice to have a pointer as to what might have happened.

Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx