Re: SLOW_OPS problems

Tim Sauerbein <sauerbein@xxxxxxxxxx> · Mon, 30 Sep 2024 15:14:15 +0100

Thanks for the replies everyone!

> On 30 Sep 2024, at 13:10, Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
> 
> Remember that slow ops are a top of the iceberg thing, you only see ones that crest above 30s

So far metrics of the hosted VMs show no other I/O slowdown except when these hiccups occur.

> On 30 Sep 2024, at 13:35, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
> 
> there is no log attached to your post, you better share it via some other means.
> 
> BTW - what log did you mean - monitor or OSD one?
> 
> It would be nice to have logs for a couple of OSDs suffering from slow ops, preferably relevant to two different cases.

Sorry, the attachments have apparently been stripped. See here for one incident (they all look the same but I can share more if relevant) monitor log, affected osd logs, iostat log:

https://gist.github.com/sauerbein/5a485a6d2546475912709743e3cfbf4b

Let me know if you need any other logs to analyse!

> On 30 Sep 2024, at 14:34, Alexander Schreiber <als@xxxxxxxxxxxxxxx> wrote:
> 
> One cause for "slow ops" I discovered are networking issues. I had slow
> ops across my entire cluster (interconnected with 10G). Turns out the
> switch was bad an achieved < 10 MBit/s on one of the 10G links.
> Replaced the switch, tested the links again - got full 10G connectivity
> and the slow ops disappeared.

Thanks for the idea. The hosts are connected to two switches with fail-over bonding, normally communicating via the same switch. I will move them all over to the second switch to rule out a switch issue.

Best regards,
Tim
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx