Re: OSDs go down with infernalis

Yoann Moulin <yoann.moulin@xxxxxxx> · Tue, 8 Mar 2016 17:21:21 +0100

Hello Adrien,

> I think I faced the same issue setting up my own cluster. If it is the same,
> it's one of the many people encounter(ed) during disk initialization. 
> Could you please give the output of :
>  - ll /dev/disk/by-partuuid/
>  - ll /var/lib/ceph/osd/ceph-*

unfortunately, I already reinstall my test cluster, but I got some information
that might explain this issue.

I was creating the journal partition before running the ansible playbook.
firstly, owner and right was not persistent at boot (had to add udev's rules).
And I strongly suspect a side effect of not let ceph-disk create journal partition.

Yoann

> On Thu, Mar 3, 2016 at 3:42 PM, Yoann Moulin <yoann.moulin@xxxxxxx
> <mailto:yoann.moulin@xxxxxxx>> wrote:
> 
>     Hello,
> 
>     I'm (almost) a new user of ceph (couple of month). In my university, we start to
>     do some test with ceph a couple of months ago.
> 
>     We have 2 clusters. Each cluster have 100 OSDs on 10 servers :
> 
>     Each server as this setup :
> 
>     CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>     Memory : 128GB of Memory
>     OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1)
>     Journal Storage : 2 x SSD 400GB Intel S3300 DC (no Raid)
>     OSD Disk : 10 x HGST ultrastar-7k6000 6TB
>     Network : 1 x 10Gb/s
>     OS : Ubuntu 14.04
>     Ceph version : infernalis 9.2.0
> 
>     One cluster give access to some user through a S3 gateway (service is still in
>     beta). We call this cluster "ceph-beta".
> 
>     One cluster is for our internal need to learn more about ceph. We call this
>     cluster "ceph-test". (those servers will be integrated into the ceph-beta
>     cluster when we will need more space)
> 
>     We have deploy both clusters with the ceph-ansible playbook[1]
> 
>     Journal are raw partitions on SSDs (400GB Intel S3300 DC) with no raid. 5
>     journals partitions on each SSDs.
> 
>     OSDs disk are format in XFS.
> 
>     1. https://github.com/ceph/ceph-ansible
> 
>     We have an issue. Some OSDs go down and don't start. It seem to be related to
>     the fsid of the journal partition :
> 
>     > -1> 2016-03-03 14:09:05.422515 7f31118d0940 -1 journal FileJournal::open:
>     ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match expected
>     eeadbce2-f096-4156-ba56-dfc634e59106, invalid (someone else's?) journal
> 
>     in attachment, the full logs of one of the dead OSDs
> 
>     We had this issue with 2 OSDs on ceph-beta cluster fixed by removing, zapping
>     and readding it.
> 
>     Now, we have the same issue on ceph-test cluster but on 18 OSDs.
> 
>     Now the stats of this cluster
> 
>     > root@icadmin004:~# ceph -s
>     >     cluster 4fb4773c-0873-44ad-a65f-269f01bfcff8
>     >      health HEALTH_WARN
>     >             1024 pgs incomplete
>     >             1024 pgs stuck inactive
>     >             1024 pgs stuck unclean
>     >      monmap e1: 3 mons at
>     {iccluster003=10.90.37.4:6789/0,iccluster014=10.90.37.15:6789/0,iccluster022=10.90.37.23:6789/0
>     <http://10.90.37.4:6789/0,iccluster014=10.90.37.15:6789/0,iccluster022=10.90.37.23:6789/0>}
>     >             election epoch 62, quorum 0,1,2
>     iccluster003,iccluster014,iccluster022
>     >      osdmap e242: 100 osds: 82 up, 82 in
>     >             flags sortbitwise
>     >       pgmap v469212: 2304 pgs, 10 pools, 2206 bytes data, 181 objects
>     >             4812 MB used, 447 TB / 447 TB avail
>     >                 1280 active+clean
>     >                 1024 creating+incomplete
> 
>     We have install this cluster at the begin of February. We did not use that
>     cluster at all even at the begin to troubleshoot an issue with ceph-ansible. We
>     did not push any data neither create pool. What could explain this behaviour ?
> 
>     Thanks for your help
> 
>     Best regards,
> 
>     --
>     Yoann Moulin
>     EPFL IC-IT
> 
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

-- 
Yoann Moulin
EPFL IC-IT
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com