Hi Bruce, On Sun, 3 Aug 2014, Bruce McFarland wrote: > Yes I looked at tcpdump on each of the OSDs and saw communications between > all 3 OSDs before I sent my first question to this list. When I disabled > selinux on the one offending server based on your feedback (typically we > have this disabled on lab systems that are only on the lab net) the 10 pages > in my test pool all went to ?active+clean? almost immediately. Unfortunately > the 3 default pools still remain in the creating states and are not > health_ok. The OSDs all stayed UP/IN after the selinux change for the rest > of the day until I made the mistake of creating a RBD image on demo-pool and > it?s 10 ?active+clean? pages. I created the rbd, but when I attempted to > look at it with ?rbd info? the cluster went into an endless loop? trying to > read a placement group and loop that I left running overnight. This morning What do you mean by "went into an endless loop"? > ceph-mon was crashed again. I?ll probably start all over from scratch once > again on Monday. Was there a stack dump in the mon log? It is possible that there is a bug with pool creation that surfaced by having selinux in place for so long, but otherwise this scenario doesn't make much sense to me. :/ Very interested in hearing more, and/or whether you can reproduce it. Thanks! sage > > ? > > I deleted ceph-mds and got rid of the ?laggy? comments from ?ceph health?. > The ?official? online Ceph docs on that ?coming soon? and most references I > could find were pre firefly so it was a little trail and error to figure out > to use the pool number and not it?s name to get the removal to work. Same > with ?ceph mds newfs? to get rid of ?laggy-ness? in the ?ceph health? > output. > > ? > > [root at essperf3 Ceph]# ceph mds rm 0? mds.essperf3 > > mds gid 0 dne > > [root at essperf3 Ceph]# ceph health > > HEALTH_WARN 96 pgs incomplete; 96 pgs peering; 192 pgs stuck inactive; 192 > pgs stuck unclean mds essperf3 is laggy > > [root at essperf3 Ceph]# ceph mds newfs 1 0? --yes-i-really-mean-it > > new fs with metadata pool 1 and data pool 0 > > [root at essperf3 Ceph]# ceph health > > HEALTH_WARN 96 pgs incomplete; 96 pgs peering; 192 pgs stuck inactive; 192 > pgs stuck unclean > > [root at essperf3 Ceph]# > > ? > > ? > > ? > > From: Brian Rak [mailto:brak at gameservers.com] > Sent: Friday, August 01, 2014 6:14 PM > To: Bruce McFarland; ceph-users at lists.ceph.com > Subject: Re: [ceph-users] Firefly OSDs stuck in creating state forever > > ? > > What happens if you remove nodown?? I'd be interested to see what OSDs it > thinks are down. My next thought would be tcpdump on the private interface.? > See if the OSDs are actually managing to connect to each other. > > For comparison, when I bring up a cluster of 3 OSDs it goes to HEALTH_OK > nearly instantly (definitely under a minute!), so it's probably not just > taking awhile. > > Does 'ceph osd dump' show the proper public and private IPs? > > On 8/1/2014 6:13 PM, Bruce McFarland wrote: > > MDS: I assumed that I?d need to bring up a ceph-mds for my > cluster at initial bringup. We also intended to modify the CRUSH > map such that it?s pool is resident to SSD(s). It is one of the > areas of the online docs there doesn?t seem to be a lot of info > on and I haven?t spent a lot of time researching. I?ll stop it. > > ? > > OSD connectivity:? The connectivity is good for both 1GE and > 10GE. I thought moving to 10GE with nothing else on that net > might help with group placement etc and bring up the pages > quicker. I?ve checked ?tcpdump? output on all boxes. > > Firewall: Thanks for that one - it?s the ?basic? I over looked > in my ceph learning curve. One of the OSDs had selinux=enforcing > ? all others were disabled. Changing that box and the 10 pages > in my demo-pool (kept page count very small for sanity) are now > ?active+clean?. The pages for the default pools ? data, > metadata, rbd ? are still stuck in ?creating+peering or > creating+incomplete. I did have to use manually set ?osd pool > default min size = 1? from it?s default of 2 ?for these 3 pools > to eliminate a bunch of warnings in the ?ceph health detail? > output. > > I?m adding the [mon] setting ?you suggested below and stopping > ceph-mds and bringing everything up now. > > [root at essperf3 Ceph]# ceph -s > > ??? cluster 4b3ffe60-73f4-4512-b7da-b04e4775dd73 > > ???? health HEALTH_WARN 96 pgs incomplete; 96 pgs peering; 192 > pgs stuck inactive; 192 pgs stuck unclean; 28 requests are > blocked > 32 sec; nodown,noscrub flag(s) set > > ???? monmap e1: 1 mons at {essperf3=209.243.160.35:6789/0}, > election epoch 1, quorum 0 essperf3 > > ???? mdsmap e43: 1/1/1 up {0=essperf3=up:creating} > > ???? osdmap e752: 3 osds: 3 up, 3 in > > ??????????? flags nodown,noscrub > > ????? pgmap v1483: 202 pgs, 4 pools, 0 bytes data, 0 objects > > ??????????? 134 MB used, 1158 GB / 1158 GB avail > > ????????????????? 96 creating+peering > > ????????????????? 10 active+clean > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<!!!!!!!! > > ????????????????? 96 creating+incomplete > > [root at essperf3 Ceph]# > > ? > > From: Brian Rak [mailto:brak at gameservers.com] > Sent: Friday, August 01, 2014 2:54 PM > To: Bruce McFarland; ceph-users at lists.ceph.com > Subject: Re: [ceph-users] Firefly OSDs stuck in creating state > forever > > ? > > Why do you have a MDS active?? I'd suggest getting rid of that at > least until you have everything else working. > > I see you've set nodown on the OSDs, did you have problems with the > OSDs flapping?? Do the OSDs have broken connectivity between > themselves?? Do you have some kind of firewall interfering here? > > I've seen odd issues when the OSDs have broken private networking, > you'll get one OSD marking all the other ones down.? Adding this to my > config helped: > > [mon] > mon osd min down reporters = 2 > > > On 8/1/2014 5:41 PM, Bruce McFarland wrote: > > Hello, > > I?ve run out of ideas and assume I?ve overlooked something > very basic. I?ve created 2 ceph clusters in the last 2 > weeks with different OSD HW and private network fabrics ? > 1GE and 10GE. I have never been? able to get the OSDs to > come up to the ?active+clean? state. I have followed your > online documentation and at this point the only thing I > don?t think I?ve done is modifying the CRUSH map (although > I have been looking into that). These are new clusters > with no data and only 1 HDD and 1 SSD per OSD (24 2.5Ghz > cores with 64GB RAM). > > ? > > Since the disks are being recycled is there something I > need to flag to let ceph just create it?s mappings, but > not scrub for data compatibility? I?ve tried setting the > noscrub flag to no effect. > > ? > > I also have constant OSD flapping. I?ve set nodown, but > assume that is just masking a problem that still > occurring. > > ? > > Besides the lack of ever reaching ?active+clean? state > ceph-mon always crashes after leaving it running > overnight. The OSDs all eventually fill /root with with > ceph logs so I regularly have to bring everything down > Delete logs and restart. > > ? > > I have all sorts of output from the ceph.conf; osd boot > ouput with ?debug osd -= 20? and ?debug ms = 1?; ceph ?w > output; and pretty much all of the debug/monitoring > suggestions from the online docs and 2 weeks of google > searches from online references in blogs, mailing lists > etc. > > ? > > [root at essperf3 Ceph]# ceph -v > > ceph version 0.80.1 > (a38fe1169b6d2ac98b427334c12d7cf81f809b74) > > [root at essperf3 Ceph]# ceph -s > > ??? cluster 4b3ffe60-73f4-4512-b7da-b04e4775dd73 > > ???? health HEALTH_WARN 96 pgs incomplete; 106 pgs > peering; 202 pgs stuck inactive; 202 pgs stuck unclean; > nodown,noscrub flag(s) set > > ???? monmap e1: 1 mons at > {essperf3=209.243.160.35:6789/0}, election epoch 1, quorum > 0 essperf3 > > ???? mdsmap e43: 1/1/1 up {0=essperf3=up:creating} > > ???? osdmap e752: 3 osds: 3 up, 3 in > > ??????????? flags nodown,noscrub > > ????? pgmap v1476: 202 pgs, 4 pools, 0 bytes data, 0 > objects > > ??????????? 134 MB used, 1158 GB / 1158 GB avail > > ???????????????? 106 creating+peering > > ???????????? ?????96 creating+incomplete > > [root at essperf3 Ceph]# > > ? > > Suggestions? > > Thanks, > > Bruce > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users at lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ? > > ? > > >