Hi Daniel, it's perfectly normal for a PG to freeze when the primary osd is not stable. It can sometimes happen that the disk fails but doesn't immediately send back I/O errors (which crash the osd). When the OSD is stopped, there's a 5-minute delay before it goes down in the crushmap. Le ven. 10 nov. 2023 à 11:43, Daniel Schreiber < daniel.schreiber@xxxxxxxxxxxxxxxxxx> a écrit : > Dear cephers, > > we are sometimes observing stalling IO on our ceph 17.2.6 cluster when > the backing device for the primary OSD of a PG fails and seems to block > read IO to objects from that pg. If I set the OSD with the broken device > to down, the IO continues. Setting the OSD to down is not sufficient. > > The cluster is running on Debian 11, the pool is an erasure coded cephfs > data pool. The OSD has a HDD data device and an SSD db device. The data > devices is the one which failed and was blocking IO. > > The OSD was reporting slow ops and short time after that smartd notified > about unreadable sectors. > > Has anyone seen such behaviour? Are there some tweaks that I missed? > > Kind regards, > > Daniel > -- > Daniel Schreiber > Facharbeitsgruppe Systemsoftware > Universitaetsrechenzentrum > > Technische Universität Chemnitz > Straße der Nationen 62 (Raum B303) > 09111 Chemnitz > Germany > > Tel: +49 371 531 35444 > Fax: +49 371 531 835444 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx