Hi,
We encounter a strange issue on one of our Ceph cluster.
Cluster has been running with no concerns since many years. But
recently, after an upgrade from Octopus 15.2.1 to 15.2.4, all is OK but
one of the 3 MON refuses to start !
Journalctl shows some problems regarding the RocksDBStore and a problem
of Bad table magic number.
On the other nodes, MONs start correctly, and all other services too,
MDS, MGR and OSD (just we have 2 OSDs stopped due to hardware disk problem).
Some OSD are nearfull but I don't think there is a link with the MON
issue ?..
Could someone help me on this issue please ?
Thnks in advance,
Hervé
Here are few logs :
# ceph versions
{
"mon": {
"ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c)
octopus (stable)": 3
},
"mgr": {
"ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c)
octopus (stable)": 3
},
"osd": {
"ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c)
octopus (stable)": 131
},
"mds": {
"ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c)
octopus (stable)": 3
},
"overall": {
"ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c)
octopus (stable)": 140
}
}
# ceph -s
cluster:
id: 00ff6861-f95c-4740-a570-da6ad6a261cd
health: HEALTH_WARN
1/3 mons down, quorum inf-ceph-mds,cluster-r730-k80-1
8 backfillfull osd(s)
18 nearfull osd(s)
Low space hindering backfill (add storage if this doesn't
resolve itself): 21 pgs backfill_toofull
Degraded data redundancy: 5570/50534523 objects degraded
(0.011%), 2 pgs degraded, 2 pgs undersized
2 pool(s) backfillfull
1 pool(s) nearfull
25 daemons have recently crashed
services:
mon: 3 daemons, quorum inf-ceph-mds,cluster-r730-k80-1 (age 44h),
out of quorum: cluster-r820-1
mgr: cluster-r730-k80-1(active, since 44h), standbys: inf-ceph-mds,
cluster-r820-1
mds: cephfs_cluster:2
{0=cluster-r730-k80-1=up:active,1=inf-ceph-mds=up:active} 1 up:standby
osd: 133 osds: 131 up (since 42h), 131 in (since 42h); 21 remapped pgs
task status:
scrub status:
mds.cluster-r730-k80-1: idle
mds.inf-ceph-mds: idle
data:
pools: 3 pools, 4129 pgs
objects: 16.84M objects, 47 TiB
usage: 143 TiB used, 33 TiB / 176 TiB avail
pgs: 5570/50534523 objects degraded (0.011%)
75039/50534523 objects misplaced (0.148%)
4104 active+clean
19 active+remapped+backfill_toofull
3 active+clean+scrubbing+deep
2 active+undersized+degraded+remapped+backfill_toofull
1 active+clean+scrubbing
io:
client: 19 MiB/s rd, 26 MiB/s wr, 5 op/s rd, 11 op/s wr
# systemctl status ceph-mon@cluster-r820-1.service
● ceph-mon@cluster-r820-1.service - Ceph cluster monitor daemon
Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled;
vendor preset: enabled)
Active: failed (Result: signal) since Fri 2020-07-10 11:44:33 CEST;
24min ago
Process: 201002 ExecStart=/usr/bin/ceph-mon -f --cluster ${CLUSTER}
--id cluster-r820-1 --setuser ceph --setgroup ceph (code=killed,
signal=ABRT)
Main PID: 201002 (code=killed, signal=ABRT)
Jul 10 11:44:33 cluster-r820-1 systemd[1]:
ceph-mon@cluster-r820-1.service: Service RestartSec=10s expired,
scheduling restart.
Jul 10 11:44:33 cluster-r820-1 systemd[1]:
ceph-mon@cluster-r820-1.service: Scheduled restart job, restart counter
is at 5.
Jul 10 11:44:33 cluster-r820-1 systemd[1]: Stopped Ceph cluster monitor
daemon.
Jul 10 11:44:33 cluster-r820-1 systemd[1]:
ceph-mon@cluster-r820-1.service: Start request repeated too quickly.
Jul 10 11:44:33 cluster-r820-1 systemd[1]:
ceph-mon@cluster-r820-1.service: Failed with result 'signal'.
Jul 10 11:44:33 cluster-r820-1 systemd[1]: Failed to start Ceph cluster
monitor daemon.
# journalctl --follow
Jul 10 11:43:38 cluster-r820-1 systemd[1]: Started Ceph cluster monitor
daemon.
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:
/build/ceph-15.2.4/src/kv/RocksDBStore.cc: In function 'virtual int
RocksDBStore::get(const string&, const string&, ceph::bufferlist*)'
thread 7fe5331a65c0 time 2020-07-10T11:43:39.192838+0200
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:
/build/ceph-15.2.4/src/kv/RocksDBStore.cc: 1152: ceph_abort_msg("Bad
table magic number: expected 9863518390377041911, found
5836668660487291005 in
/var/lib/ceph/mon/ceph-cluster-r820-1/store.db/716151.sst")
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: ceph version 15.2.4
(7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 1:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0xe1) [0x7fe534aea7c6]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce)
[0x559f6081b2ae]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 3: (main()+0x108d)
[0x559f605a85fd]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 4:
(__libc_start_main()+0xeb) [0x7fe5336ba09b]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 5: (_start()+0x2a)
[0x559f605b93da]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: *** Caught signal
(Aborted) **
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: in thread 7fe5331a65c0
thread_name:ceph-mon
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:
2020-07-10T11:43:39.192+0200 7fe5331a65c0 -1
/build/ceph-15.2.4/src/kv/RocksDBStore.cc: In function 'virtual int
RocksDBStore::get(const string&, const string&, ceph::bufferlist*)'
thread 7fe5331a65c0 time 2020-07-10T11:43:39.192838+0200
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:
/build/ceph-15.2.4/src/kv/RocksDBStore.cc: 1152: ceph_abort_msg("Bad
table magic number: expected 9863518390377041911, found
5836668660487291005 in
/var/lib/ceph/mon/ceph-cluster-r820-1/store.db/716151.sst")
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: ceph version 15.2.4
(7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 1:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0xe1) [0x7fe534aea7c6]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce)
[0x559f6081b2ae]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 3: (main()+0x108d)
[0x559f605a85fd]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 4:
(__libc_start_main()+0xeb) [0x7fe5336ba09b]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 5: (_start()+0x2a)
[0x559f605b93da]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: ceph version 15.2.4
(7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 1: (()+0x12730)
[0x7fe533bea730]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2: (gsignal()+0x10b)
[0x7fe5336cd7bb]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 3: (abort()+0x121)
[0x7fe5336b8535]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 4:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0x1b2) [0x7fe534aea897]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 5:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce)
[0x559f6081b2ae]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 6: (main()+0x108d)
[0x559f605a85fd]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 7:
(__libc_start_main()+0xeb) [0x7fe5336ba09b]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 8: (_start()+0x2a)
[0x559f605b93da]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:
2020-07-10T11:43:39.196+0200 7fe5331a65c0 -1 *** Caught signal (Aborted) **
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: in thread 7fe5331a65c0
thread_name:ceph-mon
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: ceph version 15.2.4
(7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 1: (()+0x12730)
[0x7fe533bea730]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2: (gsignal()+0x10b)
[0x7fe5336cd7bb]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 3: (abort()+0x121)
[0x7fe5336b8535]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 4:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0x1b2) [0x7fe534aea897]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 5:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce)
[0x559f6081b2ae]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 6: (main()+0x108d)
[0x559f605a85fd]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 7:
(__libc_start_main()+0xeb) [0x7fe5336ba09b]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 8: (_start()+0x2a)
[0x559f605b93da]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: NOTE: a copy of the
executable, or `objdump -rdS <executable>` is needed to interpret this.
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: -1>
2020-07-10T11:43:39.192+0200 7fe5331a65c0 -1
/build/ceph-15.2.4/src/kv/RocksDBStore.cc: In function 'virtual int
RocksDBStore::get(const string&, const string&, ceph::bufferlist*)'
thread 7fe5331a65c0 time 2020-07-10T11:43:39.192838+0200
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:
/build/ceph-15.2.4/src/kv/RocksDBStore.cc: 1152: ceph_abort_msg("Bad
table magic number: expected 9863518390377041911, found
5836668660487291005 in
/var/lib/ceph/mon/ceph-cluster-r820-1/store.db/716151.sst")
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: ceph version 15.2.4
(7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 1:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0xe1) [0x7fe534aea7c6]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce)
[0x559f6081b2ae]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 3: (main()+0x108d)
[0x559f605a85fd]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 4:
(__libc_start_main()+0xeb) [0x7fe5336ba09b]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 5: (_start()+0x2a)
[0x559f605b93da]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 0>
2020-07-10T11:43:39.196+0200 7fe5331a65c0 -1 *** Caught signal (Aborted) **
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: in thread 7fe5331a65c0
thread_name:ceph-mon
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: ceph version 15.2.4
(7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 1: (()+0x12730)
[0x7fe533bea730]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2: (gsignal()+0x10b)
[0x7fe5336cd7bb]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 3: (abort()+0x121)
[0x7fe5336b8535]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 4:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0x1b2) [0x7fe534aea897]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 5:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce)
[0x559f6081b2ae]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 6: (main()+0x108d)
[0x559f605a85fd]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 7:
(__libc_start_main()+0xeb) [0x7fe5336ba09b]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 8: (_start()+0x2a)
[0x559f605b93da]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: NOTE: a copy of the
executable, or `objdump -rdS <executable>` is needed to interpret this.
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: -1>
2020-07-10T11:43:39.192+0200 7fe5331a65c0 -1
/build/ceph-15.2.4/src/kv/RocksDBStore.cc: In function 'virtual int
RocksDBStore::get(const string&, const string&, ceph::bufferlist*)'
thread 7fe5331a65c0 time 2020-07-10T11:43:39.192838+0200
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:
/build/ceph-15.2.4/src/kv/RocksDBStore.cc: 1152: ceph_abort_msg("Bad
table magic number: expected 9863518390377041911, found
5836668660487291005 in
/var/lib/ceph/mon/ceph-cluster-r820-1/store.db/716151.sst")
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: ceph version 15.2.4
(7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 1:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0xe1) [0x7fe534aea7c6]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce)
[0x559f6081b2ae]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 3: (main()+0x108d)
[0x559f605a85fd]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 4:
(__libc_start_main()+0xeb) [0x7fe5336ba09b]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 5: (_start()+0x2a)
[0x559f605b93da]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 0>
2020-07-10T11:43:39.196+0200 7fe5331a65c0 -1 *** Caught signal (Aborted) **
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: in thread 7fe5331a65c0
thread_name:ceph-mon
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: ceph version 15.2.4
(7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 1: (()+0x12730)
[0x7fe533bea730]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2: (gsignal()+0x10b)
[0x7fe5336cd7bb]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 3: (abort()+0x121)
[0x7fe5336b8535]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 4:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0x1b2) [0x7fe534aea897]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 5:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce)
[0x559f6081b2ae]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 6: (main()+0x108d)
[0x559f605a85fd]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 7:
(__libc_start_main()+0xeb) [0x7fe5336ba09b]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 8: (_start()+0x2a)
[0x559f605b93da]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: NOTE: a copy of the
executable, or `objdump -rdS <executable>` is needed to interpret this.
Jul 10 11:43:39 cluster-r820-1 systemd[1]:
ceph-mon@cluster-r820-1.service: Main process exited, code=killed,
status=6/ABRT
Jul 10 11:43:39 cluster-r820-1 systemd[1]:
ceph-mon@cluster-r820-1.service: Failed with result 'signal'.
Jul 10 11:43:49 cluster-r820-1 systemd[1]:
ceph-mon@cluster-r820-1.service: Service RestartSec=10s expired,
scheduling restart.
...
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx