Ceph with OpenNebula - Down OSD leads to kernel errors

marcpons@xxxxxxxxxxxxxxxx (Pons) · Thu, 14 Aug 2014 10:55:41 +0800

Hi Ceph Users, 

We have deployed a cloud infrastructure and we are
using ceph (version 0.80.1) for the storage solution and
opennebula(version 4.6.1) for the compute nodes. Ceph pool is configured
to have a replication of 3. 

We have monitored one OSD to be down. We
checked the VMs (running on Centos 5.10 final and Ubuntu14.04) and
encountered certain kernel errors/messages. 

KERNEL ERRORS SEEN ON
CENTOS5.10 FINAL
hda: task_out_intr: status=0x50 { DriveReady
SeekComplete }
ide: failed opcode was: unknown
hdc: dma_timer_expiry:
dma status == 0x21
hda: irq timeout: status=0xd0 { Busy }
ide: failed
opcode was: unknown
ide0: reset: success

KERNEL ERRORS SEEN ON UBUNTU
14.04
[2955698.353338] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0
action 0x6 frozen
[2955698.356164] ata1.00: failed command: WRITE
DMA
[2955698.358428] ata1.00: cmd ca/00:08:58:a6:79/00:00:00:00:00/e0
tag 0 dma 4096 out
[2955698.358428] res
40/00:02:00:08:00/00:00:00:00:00/b0 Emask 0x4 (timeout)
[2955698.363070]
ata1.00: status: { DRDY }
[2955698.364598] ata1: soft resetting
link
[2955698.522853] ata1.00: configured for MWDMA2
[2955698.523840]
ata1.01: configured for MWDMA2
[2955698.524447] ata1.00: device reported
invalid CHS sector 0
[2955698.524476] ata1: EH complete
[2956272.037421]
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
frozen
[2956272.037424] ata1.00: failed command: WRITE
DMA
[2956272.037429] ata1.00: cmd ca/00:08:58:a6:79/00:00:00:00:00/e0
tag 0 dma 4096 out
[2956272.037429] res
40/00:02:00:08:00/00:00:00:00:00/b0 Emask 0x4 (timeout)
[2956272.037430]
ata1.00: status: { DRDY }
[2956272.037543] ata1: soft resetting
link
[2956272.193802] ata1.00: configured for MWDMA2
[2956272.194259]
ata1.01: configured for MWDMA2
[2956272.194546] ata1.00: device reported
invalid CHS sector 0
[2956272.194560] ata1: EH complete

We observed for
a few hours and concluded the the OSD is flapping. We decided to remove
the OSD out of the cluster. We checked the VMs again but these errors
are still appearing. 

Any suggestions for our next steps?

Regards,
Pons
Apollo Global Corp. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140814/5125de99/attachment.htm>