Re: OSDs stuck in booting state on CentOS 7.2.1511 and ceph infernalis 9.2.0

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Bryan,

Once the rest of the cluster was updated to v0.94.5 it now appears the one host running infernalis v9.2.0 OSDs are booting.

Bob

On Fri, Dec 18, 2015 at 3:44 PM, Bob R <bobr@xxxxxxxxxxxxxx> wrote:
Bryan,

I rebooted another host which wasn't updated to CentOS 7.2 and those OSDs also failed to come out of booting state. I thought I'd restarted each OSD host after upgrading them to infernalis but I must have been mistaken and after running ceph tell osd.* version I saw we were on a mix of v0.94.1, v0.94.2, v0.94.4, and v0.94.5. I've downgraded the two hosts we were having problems with to hammer v0.94.5 and once the cluster is happy again we will try upgrading again.

Good luck.

Bob

On Fri, Dec 18, 2015 at 3:21 PM, Stillwell, Bryan <bryan.stillwell@xxxxxxxxxxx> wrote:
I ran into a similar problem while in the middle of upgrading from Hammer (0.94.5) to Infernalis (9.2.0).  I decided to try rebuilding one of the OSDs by using 'ceph-disk prepare /dev/sdb' and it never comes up:

root@b3:~# ceph daemon osd.10 status
{
    "cluster_fsid": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "osd_fsid": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
    "whoami": 10,
    "state": "booting",
    "oldest_map": 25804,
    "newest_map": 25904,
    "num_pgs": 0
}

Here's what is written to /var/log/ceph/osd/ceph-osd.10.log:

2015-12-18 16:09:48.928462 7fd5e2bec940  0 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299), process ceph-osd, pid 6866
2015-12-18 16:09:48.931387 7fd5e2bec940  1 filestore(/var/lib/ceph/tmp/mnt.IOnlxY) mkfs in /var/lib/ceph/tmp/mnt.IOnlxY
2015-12-18 16:09:48.931417 7fd5e2bec940  1 filestore(/var/lib/ceph/tmp/mnt.IOnlxY) mkfs fsid is already set to xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
2015-12-18 16:09:48.931422 7fd5e2bec940  1 filestore(/var/lib/ceph/tmp/mnt.IOnlxY) write_version_stamp 4
2015-12-18 16:09:48.932671 7fd5e2bec940  0 filestore(/var/lib/ceph/tmp/mnt.IOnlxY) backend xfs (magic 0x58465342)
2015-12-18 16:09:48.934953 7fd5e2bec940  1 filestore(/var/lib/ceph/tmp/mnt.IOnlxY) leveldb db exists/created
2015-12-18 16:09:48.935082 7fd5e2bec940  1 journal _open /var/lib/ceph/tmp/mnt.IOnlxY/journal fd 11: 1072693248 bytes, block size 4096 bytes, directio = 1, aio = 1
2015-12-18 16:09:48.935218 7fd5e2bec940 -1 journal check: ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match expected xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, invalid (someone else's?) journal
2015-12-18 16:09:48.935227 7fd5e2bec940  1 journal close /var/lib/ceph/tmp/mnt.IOnlxY/journal
2015-12-18 16:09:48.935452 7fd5e2bec940  1 journal _open /var/lib/ceph/tmp/mnt.IOnlxY/journal fd 11: 1072693248 bytes, block size 4096 bytes, directio = 1, aio = 1
2015-12-18 16:09:48.935771 7fd5e2bec940  0 filestore(/var/lib/ceph/tmp/mnt.IOnlxY) mkjournal created journal on /var/lib/ceph/tmp/mnt.IOnlxY/journal
2015-12-18 16:09:48.935803 7fd5e2bec940  1 filestore(/var/lib/ceph/tmp/mnt.IOnlxY) mkfs done in /var/lib/ceph/tmp/mnt.IOnlxY
2015-12-18 16:09:48.935919 7fd5e2bec940  0 filestore(/var/lib/ceph/tmp/mnt.IOnlxY) backend xfs (magic 0x58465342)
2015-12-18 16:09:48.936548 7fd5e2bec940  0 genericfilestorebackend(/var/lib/ceph/tmp/mnt.IOnlxY) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-12-18 16:09:48.936559 7fd5e2bec940  0 genericfilestorebackend(/var/lib/ceph/tmp/mnt.IOnlxY) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2015-12-18 16:09:48.936588 7fd5e2bec940  0 genericfilestorebackend(/var/lib/ceph/tmp/mnt.IOnlxY) detect_features: splice is supported
2015-12-18 16:09:48.938319 7fd5e2bec940  0 genericfilestorebackend(/var/lib/ceph/tmp/mnt.IOnlxY) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2015-12-18 16:09:48.938394 7fd5e2bec940  0 xfsfilestorebackend(/var/lib/ceph/tmp/mnt.IOnlxY) detect_features: extsize is supported and your kernel >= 3.5
2015-12-18 16:09:48.940420 7fd5e2bec940  0 filestore(/var/lib/ceph/tmp/mnt.IOnlxY) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2015-12-18 16:09:48.940646 7fd5e2bec940  1 journal _open /var/lib/ceph/tmp/mnt.IOnlxY/journal fd 17: 1072693248 bytes, block size 4096 bytes, directio = 1, aio = 1
2015-12-18 16:09:48.940865 7fd5e2bec940  1 journal _open /var/lib/ceph/tmp/mnt.IOnlxY/journal fd 17: 1072693248 bytes, block size 4096 bytes, directio = 1, aio = 1
2015-12-18 16:09:48.941270 7fd5e2bec940  1 filestore(/var/lib/ceph/tmp/mnt.IOnlxY) upgrade
2015-12-18 16:09:48.941389 7fd5e2bec940 -1 filestore(/var/lib/ceph/tmp/mnt.IOnlxY) could not find -1/23c2fcde/osd_superblock/0 in index: (2) No such file or directory
2015-12-18 16:09:48.945392 7fd5e2bec940  1 journal close /var/lib/ceph/tmp/mnt.IOnlxY/journal
2015-12-18 16:09:48.946175 7fd5e2bec940 -1 created object store /var/lib/ceph/tmp/mnt.IOnlxY journal /var/lib/ceph/tmp/mnt.IOnlxY/journal for osd.10 fsid xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
2015-12-18 16:09:48.946269 7fd5e2bec940 -1 auth: error reading file: /var/lib/ceph/tmp/mnt.IOnlxY/keyring: can't open /var/lib/ceph/tmp/mnt.IOnlxY/keyring: (2) No such file or directory
2015-12-18 16:09:48.946623 7fd5e2bec940 -1 created new key in keyring /var/lib/ceph/tmp/mnt.IOnlxY/keyring
2015-12-18 16:09:50.698753 7fb5db130940  0 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299), process ceph-osd, pid 7045
2015-12-18 16:09:50.745427 7fb5db130940  0 filestore(/var/lib/ceph/osd/ceph-10) backend xfs (magic 0x58465342)
2015-12-18 16:09:50.745978 7fb5db130940  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-10) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-12-18 16:09:50.745987 7fb5db130940  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-10) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2015-12-18 16:09:50.746012 7fb5db130940  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-10) detect_features: splice is supported
2015-12-18 16:09:50.746517 7fb5db130940  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-10) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2015-12-18 16:09:50.746616 7fb5db130940  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-10) detect_features: extsize is supported and your kernel >= 3.5
2015-12-18 16:09:50.748775 7fb5db130940  0 filestore(/var/lib/ceph/osd/ceph-10) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2015-12-18 16:09:50.749005 7fb5db130940  1 journal _open /var/lib/ceph/osd/ceph-10/journal fd 19: 1072693248 bytes, block size 4096 bytes, directio = 1, aio = 1
2015-12-18 16:09:50.749256 7fb5db130940  1 journal _open /var/lib/ceph/osd/ceph-10/journal fd 19: 1072693248 bytes, block size 4096 bytes, directio = 1, aio = 1
2015-12-18 16:09:50.749632 7fb5db130940  1 filestore(/var/lib/ceph/osd/ceph-10) upgrade
2015-12-18 16:09:50.783188 7fb5db130940  0 <cls> cls/cephfs/cls_cephfs.cc:136: loading cephfs_size_scan
2015-12-18 16:09:50.851735 7fb5db130940  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
2015-12-18 16:09:50.851807 7fb5db130940  0 osd.10 0 crush map has features 33816576, adjusting msgr requires for clients
2015-12-18 16:09:50.851818 7fb5db130940  0 osd.10 0 crush map has features 33816576 was 8705, adjusting msgr requires for mons
2015-12-18 16:09:50.851821 7fb5db130940  0 osd.10 0 crush map has features 33816576, adjusting msgr requires for osds
2015-12-18 16:09:50.851965 7fb5db130940  0 osd.10 0 load_pgs
2015-12-18 16:09:50.851988 7fb5db130940  0 osd.10 0 load_pgs opened 0 pgs
2015-12-18 16:09:50.852822 7fb5db130940 -1 osd.10 0 log_to_monitors {default=true}
2015-12-18 16:09:50.870133 7fb5c7f39700  0 osd.10 0 ignoring osdmap until we have initialized
2015-12-18 16:09:50.870409 7fb5db130940  0 osd.10 0 done with init, starting boot process
2015-12-18 16:09:50.873357 7fb5c7f39700  0 osd.10 25804 crush map has features 104186773504, adjusting msgr requires for clients
2015-12-18 16:09:50.873368 7fb5c7f39700  0 osd.10 25804 crush map has features 379064680448 was 33825281, adjusting msgr requires for mons
2015-12-18 16:09:50.873374 7fb5c7f39700  0 osd.10 25804 crush map has features 379064680448, adjusting msgr requires for osds
2015-12-18 16:09:50.873377 7fb5c7f39700  0 osd.10 25804 check_osdmap_features enabling on-disk ERASURE CODES compat feature
2015-12-18 16:09:50.876187 7fb5c7f39700  0 log_channel(cluster) log [WRN] : failed to encode map e25805 with expected crc
2015-12-18 16:09:50.879534 7fb5c7f39700  0 log_channel(cluster) log [WRN] : failed to encode map e25805 with expected crc
2015-12-18 16:09:50.950405 7fb5c7f39700  0 log_channel(cluster) log [WRN] : failed to encode map e25905 with expected crc
2015-12-18 16:09:50.983355 7fb5c7f39700  0 log_channel(cluster) log [WRN] : failed to encode map e25905 with expected crc

I'm running this on Ubuntu 14.04.3 with the linux-image-generic-lts-wily kernel (4.2.0-21.25~14.04.1).

Are you running a mixed cluster right now too?  For example this is my cluster right now:

root@b1:~# ceph tell osd.* version | grep version | uniq -c
osd.10: Error ENXIO: problem getting command descriptions from osd.10
osd.10: problem getting command descriptions from osd.10
     11     "version": "ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)"
     15     "version": "ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)"

Bryan

From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Bob R <bobr@xxxxxxxxxxxxxx>
Date: Wednesday, December 16, 2015 at 11:45 AM
To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
Subject: OSDs stuck in booting state on CentOS 7.2.1511 and ceph infernalis 9.2.0

We've been operating a cluster relatively incident free since 0.86. On Monday I did a yum update on one node, ceph00, and after rebooting we're seeing every OSD stuck in 'booting' state. I've tried removing all of the OSDs and recreating them with ceph-deploy (ceph-disk required modification to use partx -a rather than partprobe) but we see the same status. I'm not sure how to troubleshoot this further. Our OSDs on this host are now running as the ceph user which may be related to the issue as the other three hosts are running as root (although I followed the steps listed to upgrade from hammer to infernalis and did chown -R ceph:ceph /var/lib/ceph on each node).

[root@ceph00 ceph]# lsb_release -idrc
Distributor ID: CentOS
Description:    CentOS Linux release 7.2.1511 (Core)
Release:        7.2.1511
Codename:       Core

[root@ceph00 ceph]# ceph --version
ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)

[root@ceph00 ceph]# ceph daemon osd.0 status
{
    "cluster_fsid": "2e4ea2c0-fb62-41fa-b7b7-e34d759b851e",
    "osd_fsid": "ddf659ad-a3db-4094-b4d0-7d50f34b8f75",
    "whoami": 0,
    "state": "booting",
    "oldest_map": 25243,
    "newest_map": 26610,
    "num_pgs": 0
}

[root@ceph00 ceph]# ceph daemon osd.3 status
{
    "cluster_fsid": "2e4ea2c0-fb62-41fa-b7b7-e34d759b851e",
    "osd_fsid": "8b1acd8a-645d-4dc2-8c1d-6dbb1715265f",
    "whoami": 3,
    "state": "booting",
    "oldest_map": 25243,
    "newest_map": 26612,
    "num_pgs": 0
}

[root@ceph00 ceph]# ceph osd tree
ID  WEIGHT    TYPE NAME           UP/DOWN REWEIGHT PRIMARY-AFFINITY
-23   1.43999 root ssd
-19         0     host ceph00_ssd
-20   0.48000     host ceph01_ssd
 40   0.48000         osd.40           up  1.00000          1.00000
-21   0.48000     host ceph02_ssd
 43   0.48000         osd.43           up  1.00000          1.00000
-22   0.48000     host ceph03_ssd
 41   0.48000         osd.41           up  1.00000          1.00000
 -1 120.00000 root default
-17  80.00000     room b1
-14  40.00000         host ceph01
  1   4.00000             osd.1        up  1.00000          1.00000
  4   4.00000             osd.4        up  1.00000          1.00000
 18   4.00000             osd.18       up  1.00000          1.00000
 19   4.00000             osd.19       up  1.00000          1.00000
 20   4.00000             osd.20       up  1.00000          1.00000
 21   4.00000             osd.21       up  1.00000          1.00000
 22   4.00000             osd.22       up  1.00000          1.00000
 23   4.00000             osd.23       up  1.00000          1.00000
 24   4.00000             osd.24       up  1.00000          1.00000
 25   4.00000             osd.25       up  1.00000          1.00000
-16  40.00000         host ceph03
 30   4.00000             osd.30       up  1.00000          1.00000
 31   4.00000             osd.31       up  1.00000          1.00000
 32   4.00000             osd.32       up  1.00000          1.00000
 33   4.00000             osd.33       up  1.00000          1.00000
 34   4.00000             osd.34       up  1.00000          1.00000
 35   4.00000             osd.35       up  1.00000          1.00000
 36   4.00000             osd.36       up  1.00000          1.00000
 37   4.00000             osd.37       up  1.00000          1.00000
 38   4.00000             osd.38       up  1.00000          1.00000
 39   4.00000             osd.39       up  1.00000          1.00000
-18  40.00000     room b2
-13         0         host ceph00
-15  40.00000         host ceph02
  2   4.00000             osd.2        up  1.00000          1.00000
  5   4.00000             osd.5        up  1.00000          1.00000
 14   4.00000             osd.14       up  1.00000          1.00000
 15   4.00000             osd.15       up  1.00000          1.00000
 16   4.00000             osd.16       up  1.00000          1.00000
 17   4.00000             osd.17       up  1.00000          1.00000
 26   4.00000             osd.26       up  1.00000          1.00000
 27   4.00000             osd.27       up  1.00000          1.00000
 28   4.00000             osd.28       up  1.00000          1.00000
 29   4.00000             osd.29       up  1.00000          1.00000
  0         0 osd.0                  down        0          1.00000
  3         0 osd.3                  down        0          1.00000
  6         0 osd.6                  down        0          1.00000
  7         0 osd.7                  down        0          1.00000
  8         0 osd.8                  down        0          1.00000
  9         0 osd.9                  down        0          1.00000
 10         0 osd.10                 down        0          1.00000
 11         0 osd.11                 down        0          1.00000
 12         0 osd.12                 down        0          1.00000
 13         0 osd.13                 down        0          1.00000


Any assistance is greatly appreciated.

Bob



This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout.


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux