OSD does not die when disk has failures

Daniel Schreiber <daniel.schreiber@xxxxxxxxxxxxxxxxxx> · Tue, 19 Mar 2024 12:50:29 +0100

Hi,

in our cluster (17.2.6) disks fail from time to time. Block devices are 
HDD, DB devices are NVME. However, the OSD process does not reliably 
die. That leads to blocked client IO for all requests for which the OSD 
with the broken disk is the primary OSD. All pools on these OSDs are EC 
pools (cephfs data or rbd data). Client IO recovers if I manually stop 
the OSD.

It seems like the error was triggered during deep scrub, because the 
cluster reported scrub errors afterwards.

OSD Log:

2024-03-11T20:12:43+01:00 urzceph1-osd05 bash[9695]: debug 
2024-03-11T19:12:43.392+0000 7fe4cad3f700  4 rocksdb: (Original Log Time 
2024/03/11-19:12:43.395747) 
[db/db_impl/db_impl_compaction_flush.cc:2818] Compaction nothing to do
2024-03-11T20:15:58+01:00 urzceph1-osd05 bash[9575]: debug 
2024-03-11T19:15:58.285+0000 7f9182765700 -1 bdev(0x55f72b8af800 
/var/lib/ceph/osd/ceph-17/block) _aio_thread got r=-5 ((5) Input/output 
error)
2024-03-11T20:15:58+01:00 urzceph1-osd05 bash[9575]: debug 
2024-03-11T19:15:58.289+0000 7f9182765700 -1 bdev(0x55f72b8af800 
/var/lib/ceph/osd/ceph-17/block) _aio_thread translating the error to 
EIO for upper layer
2024-03-11T20:15:58+01:00 urzceph1-osd05 bash[9575]: debug 
2024-03-11T19:15:58.289+0000 7f9182765700 -1 bdev(0x55f72b8af800 
/var/lib/ceph/osd/ceph-17/block) _aio_thread got r=-5 ((5) Input/output 
error)
2024-03-11T20:15:58+01:00 urzceph1-osd05 bash[9575]: debug 
2024-03-11T19:15:58.289+0000 7f9182765700 -1 bdev(0x55f72b8af800 
/var/lib/ceph/osd/ceph-17/block) _aio_thread translating the error to 
EIO for upper layer
2024-03-11T20:17:02+01:00 urzceph1-osd05 bash[10152]: debug 
2024-03-11T19:17:02.357+0000 7fcffadf4700  4 rocksdb: 
[db/db_impl/db_impl_write.cc:1736] [L] New memtable created with log 
file: #73918. Immutable memtables: 0.

Kernel Log:

[Mon Mar 11 20:15:43 2024] ata9.00: exception Emask 0x0 SAct 0xffffffff 
SErr 0xc0000 action 0x0
[Mon Mar 11 20:15:43 2024] ata9.00: irq_stat 0x40000008
[Mon Mar 11 20:15:43 2024] ata9: SError: { CommWake 10B8B }
[Mon Mar 11 20:15:43 2024] ata9.00: failed command: READ FPDMA QUEUED
[Mon Mar 11 20:15:43 2024] ata9.00: cmd 
60/f8:38:60:b2:8e/00:00:37:00:00/40 tag 7 ncq dma 126976 in
                                    res 
43/40:f0:68:b2:8e/00:00:37:00:00/40 Emask 0x409 (media error) <F>
[Mon Mar 11 20:15:43 2024] ata9.00: status: { DRDY SENSE ERR }
[Mon Mar 11 20:15:43 2024] ata9.00: error: { UNC }
[Mon Mar 11 20:15:43 2024] ata9: hard resetting link
[Mon Mar 11 20:15:43 2024] ata9: SATA link up 6.0 Gbps (SStatus 133 
SControl 300)
[Mon Mar 11 20:15:43 2024] ata9.00: configured for UDMA/133
[Mon Mar 11 20:15:43 2024] sd 8:0:0:0: [sdj] tag#7 FAILED Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s
[Mon Mar 11 20:15:43 2024] sd 8:0:0:0: [sdj] tag#7 Sense Key : Medium 
Error [current]
[Mon Mar 11 20:15:43 2024] sd 8:0:0:0: [sdj] tag#7 Add. Sense: 
Unrecovered read error - auto reallocate failed
[Mon Mar 11 20:15:43 2024] sd 8:0:0:0: [sdj] tag#7 CDB: Read(16) 88 00 
00 00 00 00 37 8e b2 60 00 00 00 f8 00 00
[Mon Mar 11 20:15:43 2024] blk_update_request: I/O error, dev sdj, 
sector 932098664 op 0x0:(READ) flags 0x0 phys_seg 29 prio class 0
[Mon Mar 11 20:15:43 2024] ata9: EH complete

Is this expected behavior or a bug? If it is expected how can we keep 
client IO flowing?

Kind regards,

Daniel
--
Daniel Schreiber
Facharbeitsgruppe Systemsoftware
Universitaetsrechenzentrum

Technische Universität Chemnitz
Straße der Nationen 62 (Raum B303)
09111 Chemnitz
Germany

Tel:     +49 371 531 35444
Fax:     +49 371 531 835444
Attachment:
smime.p7s

Description: Kryptografische S/MIME-Signatur
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx