I purged Ceph on the Kitt node, rebooted, then used ceph-deploy to reinstall and redeploy the monitors and osds.
Everything seemed to work like a charm and the node is now able to reboot and the monitor and osds come back up fast now.
Not sure why, but everything is back up and running...
Now if I could figure out the exact same issue on my other host...
Thanks,
Jon A
On Thu, Jun 27, 2013 at 10:34 AM, Jon <three18ti@xxxxxxxxx> wrote:
Hello All,I've made some progress, but I'm still having a bit of difficulty.I've got all my monitors responding now, I had to break my openvswitch configuration as ceph was trying to start, but since ovs doesn't start networking until the os is up, the monitors were unable to communicate with each other and were failing to start. Assigning the cluster IP to the physical nic resolved this issue mostly. Except, kitt is now listening on port 6800 instead of 6789 like all the other nodes:root@kitt:~# ceph -shealth HEALTH_OKmonmap e8: 3 mons at {kitt=192.168.0.35:6800/0,red6=192.168.0.40:6789/0,shepard=192.168.0.2:6789/0}, election epoch 136, quorum 0,1,2 kitt,red6,shepardosdmap e939: 4 osds: 4 up, 4 inpgmap v69300: 224 pgs: 224 active+clean; 571 GB data, 1143 GB used, 1648 GB / 2792 GB availmdsmap e1: 0/0/1 uproot@kitt:~# netstat -anp | grep cephtcp 0 0 192.168.0.35:6800 0.0.0.0:* LISTEN 1181/ceph-montcp 0 0 192.168.0.35:58259 192.168.0.2:6789 ESTABLISHED 1181/ceph-montcp 0 0 192.168.0.35:58324 192.168.0.40:6789 ESTABLISHED 1181/ceph-monunix 2 [ ACC ] STREAM LISTENING 12620 1181/ceph-mon /var/run/ceph/ceph-mon.kitt.asokThis seems to be OK usually, however, intermittently, I'm getting an error that the daemon cannot be contacted on 6789:2013-06-17 11:30:10.773333 7fd1170b9700 0 -- :/29711 >> 192.168.0.35:6789/0 pipe(0x236f510 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
How do I tell ceph to either run on port 6789 or tell it that this particular monitor is running on 6800?That said, I am unable to deploy OSDs on the kitt host. There were three running OSDs before all this trouble, however, when my cluster crashed, they would not come back up.root@kitt:~# ceph osd tree# id weight type name up/down reweight-1 2.72 root default-2 0.91 host red60 0.91 osd.0 up 1-3 0 host kitt-4 1.81 host shepard3 0.45 osd.3 up 14 0.45 osd.4 up 16 0.91 osd.6 up 1What had previously worked was removing the osd completely, then readding it, however, when I attempt to readd the OSD, the journal is created, but the mount point is not and the disk is never mounted:root@shepard:~/ceph-conf# ceph-deploy osd --zap-disk create kitt:/dev/sdb:/var/lib/ceph/osd/ceph-sdb.journalroot@kitt:~# ls -lah /var/lib/ceph/osd/total 8.0Kdrwxr-xr-x 2 root root 4.0K Jun 16 23:13 .drwxr-xr-x 8 root root 4.0K Jun 1 23:23 ..-rw-r--r-- 1 root root 1.0G Jun 12 20:14 ceph-osd-journal.sdb-rw-r--r-- 1 root root 1.0G Jun 16 23:13 ceph-sdb.journalThe last recorded error was an "Unable to open superblock" error, that was several weeks ago, but I think that coincides with the initial trouble I experienced. I have tested these disks and can confirm that they have not failed. I am able to format and mount them using either a GPT or MSDOSPT and can read/write to them. I also ran the Smartmontools long test and a full read/write test with another tool... (the name escapes me... badpart or similar), so I don't think the actual disk is at fault here. (I actually might have another disk that I can test with, but ideally I would like to use the disks that were previously in use in this host).Relevant logs:root@kitt:~# tail /var/log/ceph/ceph-osd.1.log.5-2/-2 (syslog threshold)-1/-1 (stderr threshold)max_recent 10000max_new 1000log_file /var/log/ceph/ceph-osd.1.log--- end dump of recent events ---2013-06-12 17:59:27.811593 7fb1d15817c0 0 ceph version 0.61.3 (92b1e398576d55df8e5888dd1a9545ed3fd99532), process ceph-osd, pid 54002013-06-12 17:59:27.811620 7fb1d15817c0 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-1: (5) Input/output error2013-06-12 18:57:27.689176 7f245f9957c0 0 ceph version 0.61.3 (92b1e398576d55df8e5888dd1a9545ed3fd99532), process ceph-osd, pid 49532013-06-12 18:57:27.702866 7f245f9957c0 -1 ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-1: (2) No such file or directory==> /var/log/ceph/ceph-osd.5.log.5 <==2013-06-12 18:20:04.369439 7f984a5e77c0 -1 ** ERROR: error converting store /var/lib/ceph/osd/ceph-5: (1) Operation not permitted2013-06-12 18:24:20.252341 7f88625bc7c0 0 ceph version 0.61.3 (92b1e398576d55df8e5888dd1a9545ed3fd99532), process ceph-osd, pid 49552013-06-12 18:24:20.394257 7f88625bc7c0 0 filestore(/var/lib/ceph/osd/ceph-5) mount FIEMAP ioctl is supported and appears to work2013-06-12 18:24:20.394284 7f88625bc7c0 0 filestore(/var/lib/ceph/osd/ceph-5) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option2013-06-12 18:24:20.394829 7f88625bc7c0 0 filestore(/var/lib/ceph/osd/ceph-5) mount did NOT detect btrfs2013-06-12 18:24:20.402567 7f88625bc7c0 0 filestore(/var/lib/ceph/osd/ceph-5) mount syncfs(2) syscall fully supported (by glibc and kernel)2013-06-12 18:24:20.402781 7f88625bc7c0 0 filestore(/var/lib/ceph/osd/ceph-5) mount found snaps <>2013-06-12 18:24:20.457881 7f88625bc7c0 -1 filestore(/var/lib/ceph/osd/ceph-5) Error initializing leveldb: Corruption: checksum mismatch2013-06-12 18:24:20.457944 7f88625bc7c0 -1 ** ERROR: error converting store /var/lib/ceph/osd/ceph-5: (1) Operation not permittedAny help is greatly appreciated as I am really stumped.I think my biggest frustration is the init scripts not working as described in the docs. After I use ceph-deploy, do I need to write a config file? based on my interpretation of the docs and upstart scripts, I don't think so; the respective daemons start on boot...Thanks for your time,Jon A---------- message ----------
From: Jon <three18ti@xxxxxxxxx>
Date: Sun, Jun 9, 2013 at 12:36 PM
Subject: Help Recovering Ceph cluster
To: ceph-users <ceph-users@xxxxxxxx>
Hello All,
I have deployed Ceph with ceph-deploy on three nodes. Each node is currently play double duty as both a monitor node and an osd node.This is what my cluster looked like before I rebooted the one node:>> root@red6:~# ceph -s>> health HEALTH_OK>> monmap e2: 3 mons at {kitt=192.168.0.35:6789/0,red6=192.168.0.40:6789/0,shepard=192.168.0.2:6789/0}, election epoch 10, quorum 0,1,2 kitt,red6,shepard>> osdmap e29: 7 osds: 7 up, 7 in>> pgmap v1692: 192 pgs: 192 active+clean; 19935 MB data, 40441 MB used, 4539 GB / 4580 GB avail; 73B/s rd, 0op/s>> mdsmap e1: 0/0/1 upLast week, I rebooted one of my nodes (kitt), and after it rebooted, I was unable to get it to rejoin the cluster.Typically, I've been able to use ceph-deploy to delete the monitor then deploy the monitor and everything goes back to normal.That didn't work this time, so that monitor and those two osds were out of the cluster. I think when I rebooted the node,I got errors that the mon was unreachable and it was taken out of the monmap. When I tried to delete the monitor ceph-deploywas telling me there was not monitor to delete. When I ran ceph-deploy to deploy the monitor I received no errors, but the monitor never started.I've tried a number of things in the docs, but something seems amiss because when I try to restart monitors or osds, the init script tells me it's not found.I've copied my ceph.conf at the end of this e-mail.>> root@shepard:~# ls /var/lib/ceph/mon/>> ceph-shepard>> root@shepard:~# /etc/init.d/ceph restart mon.shepard>> /etc/init.d/ceph: mon.shepard not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines )(I've also tried mon.0 .. mon.3 )Same thing with osds:>> root@shepard:~# ls /var/lib/ceph/osd/>> ceph-3 ceph-4 ceph-6 sdb.journal sdc.journal sdd.journal>> root@shepard:~# /etc/init.d/ceph restart osd.4>> /etc/init.d/ceph: osd.4 not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines )(I've tried every iteration I can think of, using `service ceph restart`, I've also tried both iterations with the -a flag e.g. `service ceph -a restart`, and stop/start instead of restart, also with and without the -a flag)That's about where I left it last week, before my work week started. Other than occasionally reporting something to the effect of:>> 2013-06-09 11:46:14.978350 7f8098e9a700 0 -- :/18915 >> 192.168.0.35:6789/0 pipe(0x7f8090002ca0 sd=4 :0 s=1 pgs=0 cs=0 l=1).faultwhen I ran ceph -s , ceph health reported OK, so I figured it was ok to leave until I was off shift and had the cycles to devote to it...I really do know better.Last night I experienced a power outage, and all of my nodes hard rebooted. Now when I run ceph health (or use any of the ceph or ceph-deploy tools),I get the following error message:>> 2013-06-09 12:12:14.692484 7f4506e0e700 0 -- :/24787 >> 192.168.0.40:6789/0 pipe(0x7f44fc000c00 sd=4 :0 s=1 pgs=0 cs=0 l=1).fault>> 2013-06-09 12:12:17.692509 7f450d5a0700 0 -- :/24787 >> 192.168.0.35:6789/0 pipe(0x7f44fc003010 sd=4 :0 s=1 pgs=0 cs=0 l=1).fault>> 2013-06-09 12:12:20.692469 7f4506e0e700 0 -- :/24787 >> 192.168.0.2:6789/0 pipe(0x7f44fc0038c0 sd=4 :0 s=1 pgs=0 cs=0 l=1).faultthis was run on the shepard node (192.168.0.2), but I get the same error from all of the nodes.I do see a number of ceph processes>> root@shepard:~# ps aux | grep ceph>> root 1041 0.0 0.0 33680 7100 ? Ss 06:11 0:18 /usr/bin/python /usr/sbin/ceph-create-keys --cluster=ceph -i shepard>> root 1538 0.0 0.0 378904 4284 ? Sl 06:11 0:04 ceph --cluster=ceph --name=osd.3 --keyring=/var/lib/ceph/osd/ceph-3/keyring osd crush create-or-move -- 3 0.45 root=default host=shepard>> root 1581 0.0 0.0 378904 4300 ? Sl 06:11 0:04 ceph --cluster=ceph --name=osd.6 --keyring=/var/lib/ceph/osd/ceph-6/keyring osd crush create-or-move -- 6 0.91 root=default host=shepard>> root 1628 0.0 0.0 378904 4296 ? Sl 06:11 0:04 ceph --cluster=ceph --name=osd.4 --keyring=/var/lib/ceph/osd/ceph-4/keyring osd crush create-or-move -- 4 0.45 root=default host=shepardHowever, I don't see the monitor process actually running.My first question is, how can I restart ceph processes?My second question is, am I on the right track here? My hypothesis is starting the monitors will get things going again, because there is nothing is listening on those ports:>> root@shepard:~# telnet 192.168.0.2 6789>> Trying 192.168.0.2...>> telnet: Unable to connect to remote host: Connection refused>> root@shepard:~# netstat -anp | grep ceph>> root@shepard:~# netstat -anp | grep 6789Actually, I just rebooted this node again and now the osds are not running. only ceph-create-keys.My last question is where does ceph-deploy create the configs? I have the original files in a directory where I ran ceph deploy,and I know about the /etc/ceph/ceph.conf file, but there seems to be some other config that the cluster is pulling from.Maybe I'm mistaken, but there are no osds in my ceph.conf.Thanks for all your help.Jon Aceph.conf file:[global]fsid = 86f293d0-bec7-4694-ace7-af3fbaa98736mon_initial_members = red6, kitt, shepardmon_host = 192.168.0.40,192.168.0.35,192.168.0.2auth_supported = cephxosd_journal_size = 1024filestore_xattr_use_omap = true
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com