Re: IO stalls when primary OSD device blocks in 17.2.6

"David C." <david.casier@xxxxxxxx> · Fri, 10 Nov 2023 12:20:45 +0100

Hi Daniel,

it's perfectly normal for a PG to freeze when the primary osd is not stable.

It can sometimes happen that the disk fails but doesn't immediately send
back I/O errors (which crash the osd).

 When the OSD is stopped, there's a 5-minute delay before it goes down in
the crushmap.

Le ven. 10 nov. 2023 à 11:43, Daniel Schreiber <
daniel.schreiber@xxxxxxxxxxxxxxxxxx> a écrit :

> Dear cephers,
>
> we are sometimes observing stalling IO on our ceph 17.2.6 cluster when
> the backing device for the primary OSD of a PG fails and seems to block
> read IO to objects from that pg. If I set the OSD with the broken device
> to down, the IO continues. Setting the OSD to down is not sufficient.
>
> The cluster is running on Debian 11, the pool is an erasure coded cephfs
> data pool. The OSD has a HDD data device and an SSD db device. The data
> devices is the one which failed and was blocking IO.
>
> The OSD was reporting slow ops and short time after that smartd notified
> about unreadable sectors.
>
> Has anyone seen such behaviour? Are there some tweaks that I missed?
>
> Kind regards,
>
> Daniel
> --
> Daniel Schreiber
> Facharbeitsgruppe Systemsoftware
> Universitaetsrechenzentrum
>
> Technische Universität Chemnitz
> Straße der Nationen 62 (Raum B303)
> 09111 Chemnitz
> Germany
>
> Tel:     +49 371 531 35444
> Fax:     +49 371 531 835444
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx