On Wed, Nov 11, 2015 at 9:54 AM, Marius Vaitiekunas <mariusvaitiekunas@xxxxxxxxx> wrote: > Hi, > > We use firefly 0.80.9. > > We have some ceph nodes in our cluster configured to use raid0. The node > configuration looks like this: > > 2xHDD - RAID1 - /dev/sda - OS > 1xSSD - RAID0 - /dev/sdb - ceph journaling disk, usually one for four data > disks > 1xHDD - RAID0 - /dev/sdc - ceph data disk > 1xHDD - RAID0 - /dev/sdd - ceph data disk > 1xHDD - RAID0 - /dev/sde - ceph data disk > 1xHDD - RAID0 - /dev/sdf - ceph data disk > .... > > We have write cache enabled on raid0. Everything is good while it works, but > we had one strange incident with cluster. Looks like SSD disk failed and > linux didn't remove it from the system. All data disks which are using this > SSD for journaling started to flap (up/down). Cluster performance dropped > down terribly. We managed to replace SSD and everything was back to normal. What was the failing drive actually giving Ceph? EIO errors? Was it still readable in terms of listing partitions etc? Was the ceph-osd process flapping (something restarting it?) or just the mon's idea of whether it was up or down? > Could it be related to raid0 usage or we encountered some other bug? We > haven't found anything similar on google. Any thoughts would be very > appreciated. Thanks in advance. You might find it interesting to follow up with whoever provides the RAID controller/software that you're using, to find out why drive failure was manifesting itself in some way other than the drive becoming fully inaccessible (which is pretty much what we expect iirc in order to properly have the OSD go away) John _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com