Re: OSDs go down with infernalis

Adrien Gillard <gillard.adrien@xxxxxxxxx> · Tue, 8 Mar 2016 10:12:18 +0100

Hello Yoann, 
I think I faced the same issue setting up my own cluster. If it is the same, it's one of the many people encounter(ed) during disk initialization. 
Could you please give the output of :
 - ll /dev/disk/by-partuuid/
 - ll /var/lib/ceph/osd/ceph-*

On Thu, Mar 3, 2016 at 3:42 PM, Yoann Moulin <yoann.moulin@xxxxxxx> wrote:
Hello,

I'm (almost) a new user of ceph (couple of month). In my university, we start to

do some test with ceph a couple of months ago.

We have 2 clusters. Each cluster have 100 OSDs on 10 servers :

Each server as this setup :

CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

Memory : 128GB of Memory

OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1)

Journal Storage : 2 x SSD 400GB Intel S3300 DC (no Raid)

OSD Disk : 10 x HGST ultrastar-7k6000 6TB

Network : 1 x 10Gb/s

OS : Ubuntu 14.04

Ceph version : infernalis 9.2.0

One cluster give access to some user through a S3 gateway (service is still in

beta). We call this cluster "ceph-beta".

One cluster is for our internal need to learn more about ceph. We call this

cluster "ceph-test". (those servers will be integrated into the ceph-beta

cluster when we will need more space)

We have deploy both clusters with the ceph-ansible playbook[1]

Journal are raw partitions on SSDs (400GB Intel S3300 DC) with no raid. 5

journals partitions on each SSDs.

OSDs disk are format in XFS.

1. https://github.com/ceph/ceph-ansible

We have an issue. Some OSDs go down and don't start. It seem to be related to

the fsid of the journal partition :

> -1> 2016-03-03 14:09:05.422515 7f31118d0940 -1 journal FileJournal::open: ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match expected eeadbce2-f096-4156-ba56-dfc634e59106, invalid (someone else's?) journal

in attachment, the full logs of one of the dead OSDs

We had this issue with 2 OSDs on ceph-beta cluster fixed by removing, zapping

and readding it.

Now, we have the same issue on ceph-test cluster but on 18 OSDs.

Now the stats of this cluster

> root@icadmin004:~# ceph -s

>     cluster 4fb4773c-0873-44ad-a65f-269f01bfcff8

>      health HEALTH_WARN

>             1024 pgs incomplete

>             1024 pgs stuck inactive

>             1024 pgs stuck unclean

>      monmap e1: 3 mons at {iccluster003=10.90.37.4:6789/0,iccluster014=10.90.37.15:6789/0,iccluster022=10.90.37.23:6789/0}

>             election epoch 62, quorum 0,1,2 iccluster003,iccluster014,iccluster022

>      osdmap e242: 100 osds: 82 up, 82 in

>             flags sortbitwise

>       pgmap v469212: 2304 pgs, 10 pools, 2206 bytes data, 181 objects

>             4812 MB used, 447 TB / 447 TB avail

>                 1280 active+clean

>                 1024 creating+incomplete

We have install this cluster at the begin of February. We did not use that

cluster at all even at the begin to troubleshoot an issue with ceph-ansible. We

did not push any data neither create pool. What could explain this behaviour ?

Thanks for your help

Best regards,

--

Yoann Moulin

EPFL IC-IT

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com