Firefly OSDs stuck in creating state forever

Bruce.McFarland@xxxxxxxxxxxxxxxx (Bruce McFarland) · Mon, 4 Aug 2014 23:03:01 +0000

I couldn't fine the ceph-mon stack dump in the log all greps for 'ceph version' weren't followed by a stack trace. 

Executed ceph-deploy purge/purgedata on the monitor and osd's. 
NOTE:  had to manually go to the individual osd shells and remove /var/lib/ceph after umount of the ceph/xfs device. Running purgedata from the monitor always failed for the osd's "still running" initially confused me, but "still mounted" wouldn't have. Executing 'ceph-deploy purge' from the monitor succeeded on all of the osd's.

Ran ceph-deploy new/install/mon create/gatherkeys/osd create on the cluster (I haven't tried using create-initial yet for the monitor, but will use it on my next install).

Modified ceph.conf
- private cluster network for each osd
- osd pool default pag/pgp 
- osd pool default size/default min size
- osd min down reporters
AND because it's not costing me anything (that I know of yet) and seems to be the first thing requested on problems:
- debug osd = 20
- debug ms = 1

Started ceph-osd on all 3 OSD servers and restarted Ceph-mon (service ceph restart) on the Monitor.

As experienced and reported by Brian my cluster came up in the HEALTH_OK state immediately with all 192 pages in the default pools 'active+clean'. It took a week or 2 longer than I would have liked, but I am quite comfortable with install/reinstall and how to inspect all components of the system state. XFS is mounted on each osd data device, using 'ceph-disk list' get the partition # for the journal on the SSD which can then be check/dump the partition with sgdisk and observe 'ceph journal'.

[root at essperf3 Ceph]# ceph -s
    cluster 32c48975-bb57-47f6-8138-e152452e3bbe
     health HEALTH_OK
     monmap e1: 1 mons at {essperf3=209.243.160.35:6789/0}, election epoch 1, quorum 0 essperf3
     osdmap e8: 3 osds: 3 up, 3 in
      pgmap v13: 192 pgs, 3 pools, 0 bytes data, 0 objects
            10106 MB used, 1148 GB / 1158 GB avail
                 192 active+clean
[root at essperf3 Ceph]# ceph osd tree
# id	weight	type name	up/down	reweight
-1	1.13	root default
-2	0.45		host ess51
0	0.45			osd.0	up	1	
-3	0.23		host ess52
1	0.23			osd.1	up	1	
-4	0.45		host ess59
2	0.45			osd.2	up	1	
[root at essperf3 Ceph]#

I'm now moving on to creating RBD image(s) and looking at 'rbd bench-write'.

I have some quick questions:
- Are there any other benchmarks in wide use for Ceph clusters?

- Our next lab deployment is going to be more "real world" and involve many (~24HDDs) HDDs per OSD chassis (2 or 2 chassis). What is the general recommendation on the number of HDDs/OSD? 1 drive/osd? Where the drive can be a LVM or  MD virtual drive spanning multiple HDD's (SW RAID 0). 

- Partitioning of the journal SSDs for multiple osd's: 
We can use 1 SSD/OSD for journal and have 4 HDD RAID 0 devices (~13TB/osd) or smaller osd's and multiple journals on each SSD. What is the recommended configuration? (This will most likely be further investigated as we move forward with benchmarking, but would like the RH/Ceph recommended Best Practices).

-As long as I maintain 1GB Ram/1TB rotational storage we can have many osd's/physical chassis? Limits?

Thank you very much for all of your help.
Bruce

-----Original Message-----
From: Sage Weil [mailto:sweil@xxxxxxxxxx] 
Sent: Monday, August 04, 2014 12:25 PM
To: Bruce McFarland
Cc: ceph-users at lists.ceph.com
Subject: Re: Firefly OSDs stuck in creating state forever

On Mon, 4 Aug 2014, Bruce McFarland wrote:
> Is there a header or first line that appears in all ceph-mon stack 
> dumps I can search for?  The couple of ceph-mon stack dumps I've seen 
> in web searches appear to all begin with "ceph version 0.xx", but 
> those are from over a year ago. Is that still the case with 0.81 firefly code?

Yep!  Here's a recentish dump:

	http://tracker.ceph.com/issues/8880

sage

> 
> -----Original Message-----
> From: Sage Weil [mailto:sweil at redhat.com]
> Sent: Monday, August 04, 2014 10:09 AM
> To: Bruce McFarland
> Cc: Brian Rak; ceph-users at lists.ceph.com
> Subject: RE: [ceph-users] Firefly OSDs stuck in creating state forever
> 
> Okay, looks like the mon went down then.
> 
> Was there a stack trace in the log after the daemon crashed?  (Or did 
> the daemon stay up but go unresponsive or something?)
> 
> Thanks!
> sage
> 
> 
> On Mon, 4 Aug 2014, Bruce McFarland wrote:
> 
> > 2014-08-04 09:57:37.144649 7f42171c8700  0 -- 
> > 209.243.160.35:0/1032499
> > >> 209.243.160.35:6789/0 pipe(0x7f4204007dd0 sd=3 :0 s=1 pgs=0 cs=0
> > l=1 c=0x7f4204001a90).fault
> > 2014-08-04 09:58:07.145097 7f4215ac3700  0 -- 
> > 209.243.160.35:0/1032499
> > >> 209.243.160.35:6789/0 pipe(0x7f4204001530 sd=3 :0 s=1 pgs=0 cs=0
> > l=1 c=0x7f4204001320).fault
> > 2014-08-04 09:58:37.145491 7f42171c8700  0 -- 
> > 209.243.160.35:0/1032499
> > >> 209.243.160.35:6789/0 pipe(0x7f4204007dd0 sd=3 :0 s=1 pgs=0 cs=0
> > l=1 c=0x7f4204003eb0).fault
> > 2014-08-04 09:59:07.145776 7f4215ac3700  0 -- 
> > 209.243.160.35:0/1032499
> > >> 209.243.160.35:6789/0 pipe(0x7f4204001530 sd=5 :0 s=1 pgs=0 cs=0
> > l=1 c=0x7f4204001320).fault
> > 2014-08-04 09:59:37.146043 7f42171c8700  0 -- 
> > 209.243.160.35:0/1032499
> > >> 209.243.160.35:6789/0 pipe(0x7f4204007dd0 sd=5 :0 s=1 pgs=0 cs=0
> > l=1 c=0x7f4204003eb0).fault
> > 2014-08-04 10:00:07.146288 7f4215ac3700  0 -- 
> > 209.243.160.35:0/1032499
> > >> 209.243.160.35:6789/0 pipe(0x7f4204001530 sd=5 :0 s=1 pgs=0 cs=0
> > l=1 c=0x7f4204001320).fault
> > 2014-08-04 10:00:37.146543 7f42171c8700  0 -- 
> > 209.243.160.35:0/1032499
> > >> 209.243.160.35:6789/0 pipe(0x7f4204007dd0 sd=5 :0 s=1 pgs=0 cs=0
> > l=1 c=0x7f4204003eb0).fault
> > 
> > 209.243.160.35 - monitor
> > 209.243.160.51 - osd.0
> > 209.243.160.52 - osd.3
> > 209.243.160.59 - osd.2
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil at redhat.com]
> > Sent: Sunday, August 03, 2014 11:15 AM
> > To: Bruce McFarland
> > Cc: Brian Rak; ceph-users at lists.ceph.com
> > Subject: Re: [ceph-users] Firefly OSDs stuck in creating state 
> > forever
> > 
> > On Sun, 3 Aug 2014, Bruce McFarland wrote:
> > > Is there a recommended way to take every thing down and restart 
> > > the process? I was considering starting completely from scratch ie 
> > > OS reinstall and then using Ceph-deploy as before.
> > 
> > If you're using ceph-deploy, then
> > 
> >  ceph-deploy purge HOST
> >  ceph-deploy purgedata HOST
> > 
> > will do it.  Then remove the ceph.* (config and keyring) files from the current directory.
> > 
> > > I've learned a lot and want to figure out a fool proof way I can 
> > > document for others in our lab to bring up a cluster on new HW.  I 
> > > learn a lot more when I break things and have to figure out what 
> > > went wrong so its a little frustrating, but I've found out a lot 
> > > about verifying the configuration and debug options so far. My 
> > > intent is to investigate rbd usage, perf, and configuration options.
> > > 
> > > The "endless loop" I'm referring to is a constant stream of fault 
> > > messages that I'm not yet familiar on how to interpret. I have let 
> > > them run to see if the cluster recovers, but Ceph-mon always crashed.
> > > I'll look for the crash dump and save it since kdump should be 
> > > enabled on the monitor box.
> > 
> > Do you have one of the messages handy?  I'm curious whether it is an OSD or a mon.
> > 
> > Thanks!
> > sage
> > 
> > 
> > 
> > > Thanks for the feedback. 
> > > 
> > > 
> > > > On Aug 3, 2014, at 8:30 AM, "Sage Weil" <sweil at redhat.com> wrote:
> > > > 
> > > > Hi Bruce,
> > > > 
> > > >> On Sun, 3 Aug 2014, Bruce McFarland wrote:
> > > >> Yes I looked at tcpdump on each of the OSDs and saw 
> > > >> communications between all 3 OSDs before I sent my first question to this list.
> > > >> When I disabled selinux on the one offending server based on 
> > > >> your feedback (typically we have this disabled on lab systems 
> > > >> that are only on the lab net) the 10 pages in my test pool all 
> > > >> went to ?active+clean? almost immediately. Unfortunately the 3 
> > > >> default pools still remain in the creating states and are not health_ok.
> > > >> The OSDs all stayed UP/IN after the selinux change for the rest 
> > > >> of the day until I made the mistake of creating a RBD image on 
> > > >> demo-pool and it?s 10 ?active+clean? pages. I created the rbd, 
> > > >> but when I attempted to look at it with ?rbd info? the cluster 
> > > >> went into an endless loop  trying to read a placement group and 
> > > >> loop that I left running overnight. This morning
> > > > 
> > > > What do you mean by "went into an endless loop"?
> > > > 
> > > >> ceph-mon was crashed again. I?ll probably start all over from 
> > > >> scratch once again on Monday.
> > > > 
> > > > Was there a stack dump in the mon log?
> > > > 
> > > > It is possible that there is a bug with pool creation that 
> > > > surfaced by having selinux in place for so long, but otherwise 
> > > > this scenario doesn't make much sense to me.  :/  Very 
> > > > interested in hearing more, and/or whether you can reproduce it.
> > > > 
> > > > Thanks!
> > > > sage
> > > > 
> > > > 
> > > >> 
> > > >>  
> > > >> 
> > > >> I deleted ceph-mds and got rid of the ?laggy? comments from ?ceph health?.
> > > >> The ?official? online Ceph docs on that ?coming soon? and most 
> > > >> references I could find were pre firefly so it was a little 
> > > >> trail and error to figure out to use the pool number and not 
> > > >> it?s name to get the removal to work. Same with ?ceph mds newfs? to get rid of ?laggy-ness? in the ?ceph health?
> > > >> output.
> > > >> 
> > > >>  
> > > >> 
> > > >> [root at essperf3 Ceph]# ceph mds rm 0  mds.essperf3
> > > >> 
> > > >> mds gid 0 dne
> > > >> 
> > > >> [root at essperf3 Ceph]# ceph health
> > > >> 
> > > >> HEALTH_WARN 96 pgs incomplete; 96 pgs peering; 192 pgs stuck 
> > > >> inactive; 192 pgs stuck unclean mds essperf3 is laggy
> > > >> 
> > > >> [root at essperf3 Ceph]# ceph mds newfs 1 0  
> > > >> --yes-i-really-mean-it
> > > >> 
> > > >> new fs with metadata pool 1 and data pool 0
> > > >> 
> > > >> [root at essperf3 Ceph]# ceph health
> > > >> 
> > > >> HEALTH_WARN 96 pgs incomplete; 96 pgs peering; 192 pgs stuck 
> > > >> inactive; 192 pgs stuck unclean
> > > >> 
> > > >> [root at essperf3 Ceph]#
> > > >> 
> > > >>  
> > > >> 
> > > >>  
> > > >> 
> > > >>  
> > > >> 
> > > >> From: Brian Rak [mailto:brak at gameservers.com]
> > > >> Sent: Friday, August 01, 2014 6:14 PM
> > > >> To: Bruce McFarland; ceph-users at lists.ceph.com
> > > >> Subject: Re: [ceph-users] Firefly OSDs stuck in creating state 
> > > >> forever
> > > >> 
> > > >>  
> > > >> 
> > > >> What happens if you remove nodown?  I'd be interested to see 
> > > >> what OSDs it thinks are down. My next thought would be tcpdump on the private interface.
> > > >> See if the OSDs are actually managing to connect to each other.
> > > >> 
> > > >> For comparison, when I bring up a cluster of 3 OSDs it goes to 
> > > >> HEALTH_OK nearly instantly (definitely under a minute!), so 
> > > >> it's probably not just taking awhile.
> > > >> 
> > > >> Does 'ceph osd dump' show the proper public and private IPs?
> > > >> 
> > > >> On 8/1/2014 6:13 PM, Bruce McFarland wrote:
> > > >> 
> > > >>      MDS: I assumed that I?d need to bring up a ceph-mds for my
> > > >>      cluster at initial bringup. We also intended to modify the CRUSH
> > > >>      map such that it?s pool is resident to SSD(s). It is one of the
> > > >>      areas of the online docs there doesn?t seem to be a lot of info
> > > >>      on and I haven?t spent a lot of time researching. I?ll stop it.
> > > >> 
> > > >>       
> > > >> 
> > > >>      OSD connectivity:  The connectivity is good for both 1GE and
> > > >>      10GE. I thought moving to 10GE with nothing else on that net
> > > >>      might help with group placement etc and bring up the pages
> > > >>      quicker. I?ve checked ?tcpdump? output on all boxes.
> > > >> 
> > > >>      Firewall: Thanks for that one - it?s the ?basic? I over looked
> > > >>      in my ceph learning curve. One of the OSDs had selinux=enforcing
> > > >>      ? all others were disabled. Changing that box and the 10 pages
> > > >>      in my demo-pool (kept page count very small for sanity) are now
> > > >>      ?active+clean?. The pages for the default pools ? data,
> > > >>      metadata, rbd ? are still stuck in  creating+peering or
> > > >>      creating+incomplete. I did have to use manually set ?osd pool
> > > >>      default min size = 1? from it?s default of 2  for these 3 pools
> > > >>      to eliminate a bunch of warnings in the ?ceph health detail?
> > > >>      output.
> > > >> 
> > > >>      I?m adding the [mon] setting  you suggested below and stopping
> > > >>      ceph-mds and bringing everything up now.
> > > >> 
> > > >>      [root at essperf3 Ceph]# ceph -s
> > > >> 
> > > >>          cluster 4b3ffe60-73f4-4512-b7da-b04e4775dd73
> > > >> 
> > > >>           health HEALTH_WARN 96 pgs incomplete; 96 pgs peering; 192
> > > >>      pgs stuck inactive; 192 pgs stuck unclean; 28 requests are
> > > >>      blocked > 32 sec; nodown,noscrub flag(s) set
> > > >> 
> > > >>           monmap e1: 1 mons at {essperf3=209.243.160.35:6789/0},
> > > >>      election epoch 1, quorum 0 essperf3
> > > >> 
> > > >>           mdsmap e43: 1/1/1 up {0=essperf3=up:creating}
> > > >> 
> > > >>           osdmap e752: 3 osds: 3 up, 3 in
> > > >> 
> > > >>                  flags nodown,noscrub
> > > >> 
> > > >>            pgmap v1483: 202 pgs, 4 pools, 0 bytes data, 0 
> > > >> objects
> > > >> 
> > > >>                  134 MB used, 1158 GB / 1158 GB avail
> > > >> 
> > > >>                        96 creating+peering
> > > >> 
> > > >>                        10 active+clean
> > > >>      <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<!!!!!!!!
> > > >> 
> > > >>                        96 creating+incomplete
> > > >> 
> > > >>      [root at essperf3 Ceph]#
> > > >> 
> > > >>       
> > > >> 
> > > >>      From: Brian Rak [mailto:brak at gameservers.com]
> > > >>      Sent: Friday, August 01, 2014 2:54 PM
> > > >>      To: Bruce McFarland; ceph-users at lists.ceph.com
> > > >>      Subject: Re: [ceph-users] Firefly OSDs stuck in creating state
> > > >>      forever
> > > >> 
> > > >>  
> > > >> 
> > > >> Why do you have a MDS active?  I'd suggest getting rid of that 
> > > >> at least until you have everything else working.
> > > >> 
> > > >> I see you've set nodown on the OSDs, did you have problems with 
> > > >> the OSDs flapping?  Do the OSDs have broken connectivity 
> > > >> between themselves?  Do you have some kind of firewall interfering here?
> > > >> 
> > > >> I've seen odd issues when the OSDs have broken private 
> > > >> networking, you'll get one OSD marking all the other ones down.
> > > >> Adding this to my config helped:
> > > >> 
> > > >> [mon]
> > > >> mon osd min down reporters = 2
> > > >> 
> > > >> 
> > > >> On 8/1/2014 5:41 PM, Bruce McFarland wrote:
> > > >> 
> > > >>      Hello,
> > > >> 
> > > >>      I?ve run out of ideas and assume I?ve overlooked something
> > > >>      very basic. I?ve created 2 ceph clusters in the last 2
> > > >>      weeks with different OSD HW and private network fabrics ?
> > > >>      1GE and 10GE. I have never been  able to get the OSDs to
> > > >>      come up to the ?active+clean? state. I have followed your
> > > >>      online documentation and at this point the only thing I
> > > >>      don?t think I?ve done is modifying the CRUSH map (although
> > > >>      I have been looking into that). These are new clusters
> > > >>      with no data and only 1 HDD and 1 SSD per OSD (24 2.5Ghz
> > > >>      cores with 64GB RAM).
> > > >> 
> > > >>       
> > > >> 
> > > >>      Since the disks are being recycled is there something I
> > > >>      need to flag to let ceph just create it?s mappings, but
> > > >>      not scrub for data compatibility? I?ve tried setting the
> > > >>      noscrub flag to no effect.
> > > >> 
> > > >>       
> > > >> 
> > > >>      I also have constant OSD flapping. I?ve set nodown, but
> > > >>      assume that is just masking a problem that still
> > > >>      occurring.
> > > >> 
> > > >>       
> > > >> 
> > > >>      Besides the lack of ever reaching ?active+clean? state
> > > >>      ceph-mon always crashes after leaving it running
> > > >>      overnight. The OSDs all eventually fill /root with with
> > > >>      ceph logs so I regularly have to bring everything down
> > > >>      Delete logs and restart.
> > > >> 
> > > >>       
> > > >> 
> > > >>      I have all sorts of output from the ceph.conf; osd boot
> > > >>      ouput with ?debug osd -= 20? and ?debug ms = 1?; ceph ?w
> > > >>      output; and pretty much all of the debug/monitoring
> > > >>      suggestions from the online docs and 2 weeks of google
> > > >>      searches from online references in blogs, mailing lists
> > > >>      etc.
> > > >> 
> > > >>       
> > > >> 
> > > >>      [root at essperf3 Ceph]# ceph -v
> > > >> 
> > > >>      ceph version 0.80.1
> > > >>      (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
> > > >> 
> > > >>      [root at essperf3 Ceph]# ceph -s
> > > >> 
> > > >>          cluster 4b3ffe60-73f4-4512-b7da-b04e4775dd73
> > > >> 
> > > >>           health HEALTH_WARN 96 pgs incomplete; 106 pgs
> > > >>      peering; 202 pgs stuck inactive; 202 pgs stuck unclean;
> > > >>      nodown,noscrub flag(s) set
> > > >> 
> > > >>           monmap e1: 1 mons at
> > > >>      {essperf3=209.243.160.35:6789/0}, election epoch 1, quorum
> > > >>      0 essperf3
> > > >> 
> > > >>           mdsmap e43: 1/1/1 up {0=essperf3=up:creating}
> > > >> 
> > > >>           osdmap e752: 3 osds: 3 up, 3 in
> > > >> 
> > > >>                  flags nodown,noscrub
> > > >> 
> > > >>            pgmap v1476: 202 pgs, 4 pools, 0 bytes data, 0
> > > >>      objects
> > > >> 
> > > >>                  134 MB used, 1158 GB / 1158 GB avail
> > > >> 
> > > >>                       106 creating+peering
> > > >> 
> > > >>                        96 creating+incomplete
> > > >> 
> > > >>      [root at essperf3 Ceph]#
> > > >> 
> > > >>       
> > > >> 
> > > >>      Suggestions?
> > > >> 
> > > >>      Thanks,
> > > >> 
> > > >>      Bruce
> > > >> 
> > > >> 
> > > >> 
> > > >> 
> > > >> 
> > > >> _______________________________________________
> > > >> 
> > > >> ceph-users mailing list
> > > >> 
> > > >> ceph-users at lists.ceph.com
> > > >> 
> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >> 
> > > >>  
> > > >> 
> > > >>  
> > > >> 
> > > >> 
> > > 
> > > 
> > 
> > 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
>