Hi all, I just had a very weird incident on our production cluster. An OSD was reporting >50K slow ops. Upon further investigation I observed exceptionally high network traffic on 3 out of the 12 hosts in this OSD's pools, one of them was the host with the slow ops OSD (ceph-09); see the image here (bytes received): https://imgur.com/a/gPQDiq5. The incoming data bandwidth is about 700MB/s (or a factor 4) higher than on all other hosts. Strange thing is, that on this OSD is not part of any 3xreplicated pool. The 2 pools of this OSD are 8+2 and 8+3 EC pools. Hence, this is neither user- nor replication traffic. It looks like 3 OSDs in that pool decided to have a private meeting and ignore everything around them. My first attempt of recovery was: ceph osd set norecover ceph osd set norebalance ceph osd out 669 And wait. Indeed, PGs peered and user IO bandwidth went up by a factor of 2. In addition, the slow ops count started falling. In the image, the execution of these commands is visible as the peak at 10:45. After about 3 minutes, the slow ops count was 0 and I set the OSD back to in and unset all flags. Nothing happened, the cluster just continued operating normally. Does anyone have an explanation for what I observed? It looks a lot like a large amount fake traffic, 3 OSDs just sending packets in circles. During recovery, the OSD with 50K slow ops had nearly no disk IO, therefore I do not believe that this was actual IO. I rather suspect that it was internal communication going bonkers. Since the impact is quite high it would be nice to have a pointer as to what might have happened. Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx