Re: raid0 and ceph?

John Spray <jspray@xxxxxxxxxx> · Wed, 11 Nov 2015 14:12:50 +0000

On Wed, Nov 11, 2015 at 9:54 AM, Marius Vaitiekunas
<mariusvaitiekunas@xxxxxxxxx> wrote:
> Hi,
>
> We use firefly 0.80.9.
>
> We have some ceph nodes in our cluster configured to use raid0. The node
> configuration looks like this:
>
> 2xHDD - RAID1 - /dev/sda  -  OS
> 1xSSD - RAID0 - /dev/sdb  -  ceph journaling disk, usually one for four data
> disks
> 1xHDD - RAID0 - /dev/sdc  -  ceph data disk
> 1xHDD - RAID0 - /dev/sdd  -  ceph data disk
> 1xHDD - RAID0 - /dev/sde  -  ceph data disk
> 1xHDD - RAID0 - /dev/sdf   -  ceph data disk
> ....
>
> We have write cache enabled on raid0. Everything is good while it works, but
> we had one strange incident with cluster. Looks like SSD disk failed and
> linux didn't remove it from the system. All data disks which are using this
> SSD for journaling started to flap (up/down). Cluster performance dropped
> down terribly. We managed to replace SSD and everything was back to normal.

What was the failing drive actually giving Ceph?  EIO errors?  Was it
still readable in terms of listing partitions etc?  Was the ceph-osd
process flapping (something restarting it?) or just the mon's idea of
whether it was up or down?

> Could it be related to raid0 usage or we encountered some other bug? We
> haven't found anything similar on google. Any thoughts would be very
> appreciated. Thanks in advance.

You might find it interesting to follow up with whoever provides the
RAID controller/software that you're using, to find out why drive
failure was manifesting itself in some way other than the drive
becoming fully inaccessible (which is pretty much what we expect iirc
in order to properly have the OSD go away)

John
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com