IO stalls when primary OSD device blocks in 17.2.6

Daniel Schreiber <daniel.schreiber@xxxxxxxxxxxxxxxxxx> · Fri, 10 Nov 2023 11:41:55 +0100

Dear cephers,

we are sometimes observing stalling IO on our ceph 17.2.6 cluster when 
the backing device for the primary OSD of a PG fails and seems to block 
read IO to objects from that pg. If I set the OSD with the broken device 
to down, the IO continues. Setting the OSD to down is not sufficient.

The cluster is running on Debian 11, the pool is an erasure coded cephfs 
data pool. The OSD has a HDD data device and an SSD db device. The data 
devices is the one which failed and was blocking IO.

The OSD was reporting slow ops and short time after that smartd notified 
about unreadable sectors.

Has anyone seen such behaviour? Are there some tweaks that I missed?

Kind regards,

Daniel
--
Daniel Schreiber
Facharbeitsgruppe Systemsoftware
Universitaetsrechenzentrum

Technische Universität Chemnitz
Straße der Nationen 62 (Raum B303)
09111 Chemnitz
Germany

Tel:     +49 371 531 35444
Fax:     +49 371 531 835444
Attachment:
smime.p7s

Description: Kryptografische S/MIME-Signatur
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx