Re: OSDs go down with infernalis

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Yoann, 

I think I faced the same issue setting up my own cluster. If it is the same, it's one of the many people encounter(ed) during disk initialization. 
Could you please give the output of :
 - ll /dev/disk/by-partuuid/
 - ll /var/lib/ceph/osd/ceph-*



On Thu, Mar 3, 2016 at 3:42 PM, Yoann Moulin <yoann.moulin@xxxxxxx> wrote:
Hello,

I'm (almost) a new user of ceph (couple of month). In my university, we start to
do some test with ceph a couple of months ago.

We have 2 clusters. Each cluster have 100 OSDs on 10 servers :

Each server as this setup :

CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Memory : 128GB of Memory
OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1)
Journal Storage : 2 x SSD 400GB Intel S3300 DC (no Raid)
OSD Disk : 10 x HGST ultrastar-7k6000 6TB
Network : 1 x 10Gb/s
OS : Ubuntu 14.04
Ceph version : infernalis 9.2.0

One cluster give access to some user through a S3 gateway (service is still in
beta). We call this cluster "ceph-beta".

One cluster is for our internal need to learn more about ceph. We call this
cluster "ceph-test". (those servers will be integrated into the ceph-beta
cluster when we will need more space)

We have deploy both clusters with the ceph-ansible playbook[1]

Journal are raw partitions on SSDs (400GB Intel S3300 DC) with no raid. 5
journals partitions on each SSDs.

OSDs disk are format in XFS.

1. https://github.com/ceph/ceph-ansible

We have an issue. Some OSDs go down and don't start. It seem to be related to
the fsid of the journal partition :

> -1> 2016-03-03 14:09:05.422515 7f31118d0940 -1 journal FileJournal::open: ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match expected eeadbce2-f096-4156-ba56-dfc634e59106, invalid (someone else's?) journal

in attachment, the full logs of one of the dead OSDs

We had this issue with 2 OSDs on ceph-beta cluster fixed by removing, zapping
and readding it.

Now, we have the same issue on ceph-test cluster but on 18 OSDs.

Now the stats of this cluster

> root@icadmin004:~# ceph -s
>     cluster 4fb4773c-0873-44ad-a65f-269f01bfcff8
>      health HEALTH_WARN
>             1024 pgs incomplete
>             1024 pgs stuck inactive
>             1024 pgs stuck unclean
>      monmap e1: 3 mons at {iccluster003=10.90.37.4:6789/0,iccluster014=10.90.37.15:6789/0,iccluster022=10.90.37.23:6789/0}
>             election epoch 62, quorum 0,1,2 iccluster003,iccluster014,iccluster022
>      osdmap e242: 100 osds: 82 up, 82 in
>             flags sortbitwise
>       pgmap v469212: 2304 pgs, 10 pools, 2206 bytes data, 181 objects
>             4812 MB used, 447 TB / 447 TB avail
>                 1280 active+clean
>                 1024 creating+incomplete

We have install this cluster at the begin of February. We did not use that
cluster at all even at the begin to troubleshoot an issue with ceph-ansible. We
did not push any data neither create pool. What could explain this behaviour ?

Thanks for your help

Best regards,

--
Yoann Moulin
EPFL IC-IT

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux