Firefly OSDs stuck in creating state forever

brak@xxxxxxxxxxxxxxx (Brian Rak) · Fri, 01 Aug 2014 21:13:56 -0400

What happens if you remove nodown?  I'd be interested to see what OSDs 
it thinks are down. My next thought would be tcpdump on the private 
interface.  See if the OSDs are actually managing to connect to each other.

For comparison, when I bring up a cluster of 3 OSDs it goes to HEALTH_OK 
nearly instantly (definitely under a minute!), so it's probably not just 
taking awhile.

Does 'ceph osd dump' show the proper public and private IPs?

On 8/1/2014 6:13 PM, Bruce McFarland wrote:
>
> MDS: I assumed that I'd need to bring up a ceph-mds for my cluster at 
> initial bringup. We also intended to modify the CRUSH map such that 
> it's pool is resident to SSD(s). It is one of the areas of the online 
> docs there doesn't seem to be a lot of info on and I haven't spent a 
> lot of time researching. I'll stop it.
>
> OSD connectivity:  The connectivity is good for both 1GE and 10GE. I 
> thought moving to 10GE with nothing else on that net might help with 
> group placement etc and bring up the pages quicker. I've checked 
> 'tcpdump' output on all boxes.
>
> Firewall: Thanks for that one - it's the "basic" I over looked in my 
> ceph learning curve. One of the OSDs had selinux=enforcing -- all 
> others were disabled. Changing that box and the 10 pages in my 
> demo-pool (kept page count very small for sanity) are now 
> 'active+clean'. The pages for the default pools -- data, metadata, rbd 
> -- are still stuck in  creating+peering or creating+incomplete. I did 
> have to use manually set 'osd pool default min size = 1' from it's 
> default of 2  for these 3 pools to eliminate a bunch of warnings in 
> the 'ceph health detail' output.
>
> I'm adding the [mon] setting  you suggested below and stopping 
> ceph-mds and bringing everything up now.
>
> [root at essperf3 Ceph]# ceph -s
>
>     cluster 4b3ffe60-73f4-4512-b7da-b04e4775dd73
>
>      health HEALTH_WARN 96 pgs incomplete; 96 pgs peering; 192 pgs 
> stuck inactive; 192 pgs stuck unclean; 28 requests are blocked > 32 
> sec; nodown,noscrub flag(s) set
>
>      monmap e1: 1 mons at {essperf3=209.243.160.35:6789/0}, election 
> epoch 1, quorum 0 essperf3
>
>      mdsmap e43: 1/1/1 up {0=essperf3=up:creating}
>
>      osdmap e752: 3 osds: 3 up, 3 in
>
> flags nodown,noscrub
>
>       pgmap v1483: 202 pgs, 4 pools, 0 bytes data, 0 objects
>
>             134 MB used, 1158 GB / 1158 GB avail
>
> 96 creating+peering
>
> 10 active+clean <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<!!!!!!!!
>
> 96 creating+incomplete
>
> [root at essperf3 Ceph]#
>
> *From:*Brian Rak [mailto:brak at gameservers.com]
> *Sent:* Friday, August 01, 2014 2:54 PM
> *To:* Bruce McFarland; ceph-users at lists.ceph.com
> *Subject:* Re: [ceph-users] Firefly OSDs stuck in creating state forever
>
> Why do you have a MDS active?  I'd suggest getting rid of that at 
> least until you have everything else working.
>
> I see you've set nodown on the OSDs, did you have problems with the 
> OSDs flapping?  Do the OSDs have broken connectivity between 
> themselves?  Do you have some kind of firewall interfering here?
>
> I've seen odd issues when the OSDs have broken private networking, 
> you'll get one OSD marking all the other ones down.  Adding this to my 
> config helped:
>
> [mon]
> mon osd min down reporters = 2
>
> On 8/1/2014 5:41 PM, Bruce McFarland wrote:
>
>     Hello,
>
>     I've run out of ideas and assume I've overlooked something very
>     basic. I've created 2 ceph clusters in the last 2 weeks with
>     different OSD HW and private network fabrics -- 1GE and 10GE. I
>     have never been able to get the OSDs to come up to the
>     'active+clean' state. I have followed your online documentation
>     and at this point the only thing I don't think I've done is
>     modifying the CRUSH map (although I have been looking into that).
>     These are new clusters with no data and only 1 HDD and 1 SSD per
>     OSD (24 2.5Ghz cores with 64GB RAM).
>
>     Since the disks are being recycled is there something I need to
>     flag to let ceph just create it's mappings, but not scrub for data
>     compatibility? I've tried setting the noscrub flag to no effect.
>
>     I also have constant OSD flapping. I've set nodown, but assume
>     that is just masking a problem that still occurring.
>
>     Besides the lack of ever reaching 'active+clean' state ceph-mon
>     always crashes after leaving it running overnight. The OSDs all
>     eventually fill /root with with ceph logs so I regularly have to
>     bring everything down Delete logs and restart.
>
>     I have all sorts of output from the ceph.conf; osd boot ouput with
>     'debug osd -= 20' and 'debug ms = 1'; ceph --w output; and pretty
>     much all of the debug/monitoring suggestions from the online docs
>     and 2 weeks of google searches from online references in blogs,
>     mailing lists etc.
>
>     [root at essperf3 Ceph]# ceph -v
>
>     ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
>
>     [root at essperf3 Ceph]# ceph -s
>
>         cluster 4b3ffe60-73f4-4512-b7da-b04e4775dd73
>
>          health HEALTH_WARN 96 pgs incomplete; 106 pgs peering; 202
>     pgs stuck inactive; 202 pgs stuck unclean; nodown,noscrub flag(s) set
>
>          monmap e1: 1 mons at {essperf3=209.243.160.35:6789/0},
>     election epoch 1, quorum 0 essperf3
>
>          mdsmap e43: 1/1/1 up {0=essperf3=up:creating}
>
>          osdmap e752: 3 osds: 3 up, 3 in
>
>                 flags nodown,noscrub
>
>           pgmap v1476: 202 pgs, 4 pools, 0 bytes data, 0 objects
>
>                 134 MB used, 1158 GB / 1158 GB avail
>
>                      106 creating+peering
>
>                       96 creating+incomplete
>
>     [root at essperf3 Ceph]#
>
>     Suggestions?
>
>     Thanks,
>
>     Bruce
>
>
>
>
>     _______________________________________________
>
>     ceph-users mailing list
>
>     ceph-users at lists.ceph.com  <mailto:ceph-users at lists.ceph.com>
>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140801/5f0eb9f3/attachment.htm>