Re: OSD down cause all OSD slow ops

Boris Behrens <bb@xxxxxxxxx> · Thu, 30 Mar 2023 18:35:28 +0200

Hi,
you might suffer from the same bug we suffered:
https://tracker.ceph.com/issues/53729

https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/KG35GRTN4ZIDWPLJZ5OQOKERUIQT5WQ6/#K45MJ63J37IN2HNAQXVOOT3J6NTXIHCA

Basically there is a bug that prevents the removal of PGlog items. You need
to update to pacific for the fix. There is also a very easy check if you
MIGHT be affected: https://tracker.ceph.com/issues/53729#note-65

Am Do., 30. März 2023 um 17:02 Uhr schrieb <petersun@xxxxxxxxxxxx>:

> We experienced a Ceph failure causing the system to become unresponsive
> with no IOPS or throughput due to a problematic OSD process on one node.
> This resulted in slow operations and no IOPS for all other OSDs in the
> cluster. The incident timeline is as follows:
>
> Alert triggered for OSD problem.
> 6 out of 12 OSDs on the node were down.
> Soft restart attempted, but smartmontools process stuck while shutting
> down server.
> Hard restart attempted and service resumed as usual.
>
> Our Ceph cluster has 19 nodes, 218 OSDs, and is using version 15.2.17
> octopus (stable).
>
> Questions:
> 1. What is Ceph's detection mechanism? Why couldn't Ceph detect the faulty
> node and automatically abandon its resources?
> 2. Did we miss any patches or bug fixes?
> 3. Suggestions for improvements to quickly detect and avoid similar issues
> in the future?
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx