I would re-create the MON, it's usually faster than debugging the root cause.
Zitat von Hervé Ballans <herve.ballans@xxxxxxxxxxxxx>:
Hi,
We encounter a strange issue on one of our Ceph cluster.
Cluster has been running with no concerns since many years. But
recently, after an upgrade from Octopus 15.2.1 to 15.2.4, all is OK
but one of the 3 MON refuses to start !
Journalctl shows some problems regarding the RocksDBStore and a
problem of Bad table magic number.
On the other nodes, MONs start correctly, and all other services
too, MDS, MGR and OSD (just we have 2 OSDs stopped due to hardware
disk problem).
Some OSD are nearfull but I don't think there is a link with the MON
issue ?..
Could someone help me on this issue please ?
Thnks in advance,
Hervé
Here are few logs :
# ceph versions
{
"mon": {
"ceph version 15.2.4
(7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 3
},
"mgr": {
"ceph version 15.2.4
(7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 3
},
"osd": {
"ceph version 15.2.4
(7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 131
},
"mds": {
"ceph version 15.2.4
(7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 3
},
"overall": {
"ceph version 15.2.4
(7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)": 140
}
}
# ceph -s
cluster:
id: 00ff6861-f95c-4740-a570-da6ad6a261cd
health: HEALTH_WARN
1/3 mons down, quorum inf-ceph-mds,cluster-r730-k80-1
8 backfillfull osd(s)
18 nearfull osd(s)
Low space hindering backfill (add storage if this
doesn't resolve itself): 21 pgs backfill_toofull
Degraded data redundancy: 5570/50534523 objects degraded
(0.011%), 2 pgs degraded, 2 pgs undersized
2 pool(s) backfillfull
1 pool(s) nearfull
25 daemons have recently crashed
services:
mon: 3 daemons, quorum inf-ceph-mds,cluster-r730-k80-1 (age
44h), out of quorum: cluster-r820-1
mgr: cluster-r730-k80-1(active, since 44h), standbys:
inf-ceph-mds, cluster-r820-1
mds: cephfs_cluster:2
{0=cluster-r730-k80-1=up:active,1=inf-ceph-mds=up:active} 1 up:standby
osd: 133 osds: 131 up (since 42h), 131 in (since 42h); 21 remapped pgs
task status:
scrub status:
mds.cluster-r730-k80-1: idle
mds.inf-ceph-mds: idle
data:
pools: 3 pools, 4129 pgs
objects: 16.84M objects, 47 TiB
usage: 143 TiB used, 33 TiB / 176 TiB avail
pgs: 5570/50534523 objects degraded (0.011%)
75039/50534523 objects misplaced (0.148%)
4104 active+clean
19 active+remapped+backfill_toofull
3 active+clean+scrubbing+deep
2 active+undersized+degraded+remapped+backfill_toofull
1 active+clean+scrubbing
io:
client: 19 MiB/s rd, 26 MiB/s wr, 5 op/s rd, 11 op/s wr
# systemctl status ceph-mon@cluster-r820-1.service
● ceph-mon@cluster-r820-1.service - Ceph cluster monitor daemon
Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled;
vendor preset: enabled)
Active: failed (Result: signal) since Fri 2020-07-10 11:44:33
CEST; 24min ago
Process: 201002 ExecStart=/usr/bin/ceph-mon -f --cluster
${CLUSTER} --id cluster-r820-1 --setuser ceph --setgroup ceph
(code=killed, signal=ABRT)
Main PID: 201002 (code=killed, signal=ABRT)
Jul 10 11:44:33 cluster-r820-1 systemd[1]:
ceph-mon@cluster-r820-1.service: Service RestartSec=10s expired,
scheduling restart.
Jul 10 11:44:33 cluster-r820-1 systemd[1]:
ceph-mon@cluster-r820-1.service: Scheduled restart job, restart
counter is at 5.
Jul 10 11:44:33 cluster-r820-1 systemd[1]: Stopped Ceph cluster
monitor daemon.
Jul 10 11:44:33 cluster-r820-1 systemd[1]:
ceph-mon@cluster-r820-1.service: Start request repeated too quickly.
Jul 10 11:44:33 cluster-r820-1 systemd[1]:
ceph-mon@cluster-r820-1.service: Failed with result 'signal'.
Jul 10 11:44:33 cluster-r820-1 systemd[1]: Failed to start Ceph
cluster monitor daemon.
# journalctl --follow
Jul 10 11:43:38 cluster-r820-1 systemd[1]: Started Ceph cluster
monitor daemon.
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:
/build/ceph-15.2.4/src/kv/RocksDBStore.cc: In function 'virtual int
RocksDBStore::get(const string&, const string&, ceph::bufferlist*)'
thread 7fe5331a65c0 time 2020-07-10T11:43:39.192838+0200
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:
/build/ceph-15.2.4/src/kv/RocksDBStore.cc: 1152: ceph_abort_msg("Bad
table magic number: expected 9863518390377041911, found
5836668660487291005 in
/var/lib/ceph/mon/ceph-cluster-r820-1/store.db/716151.sst")
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: ceph version
15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 1:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0xe1) [0x7fe534aea7c6]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce)
[0x559f6081b2ae]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 3: (main()+0x108d)
[0x559f605a85fd]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 4:
(__libc_start_main()+0xeb) [0x7fe5336ba09b]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 5: (_start()+0x2a)
[0x559f605b93da]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: *** Caught signal
(Aborted) **
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: in thread
7fe5331a65c0 thread_name:ceph-mon
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:
2020-07-10T11:43:39.192+0200 7fe5331a65c0 -1
/build/ceph-15.2.4/src/kv/RocksDBStore.cc: In function 'virtual int
RocksDBStore::get(const string&, const string&, ceph::bufferlist*)'
thread 7fe5331a65c0 time 2020-07-10T11:43:39.192838+0200
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:
/build/ceph-15.2.4/src/kv/RocksDBStore.cc: 1152: ceph_abort_msg("Bad
table magic number: expected 9863518390377041911, found
5836668660487291005 in
/var/lib/ceph/mon/ceph-cluster-r820-1/store.db/716151.sst")
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: ceph version
15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 1:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0xe1) [0x7fe534aea7c6]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce)
[0x559f6081b2ae]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 3: (main()+0x108d)
[0x559f605a85fd]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 4:
(__libc_start_main()+0xeb) [0x7fe5336ba09b]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 5: (_start()+0x2a)
[0x559f605b93da]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: ceph version
15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 1: (()+0x12730)
[0x7fe533bea730]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2:
(gsignal()+0x10b) [0x7fe5336cd7bb]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 3: (abort()+0x121)
[0x7fe5336b8535]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 4:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0x1b2) [0x7fe534aea897]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 5:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce)
[0x559f6081b2ae]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 6: (main()+0x108d)
[0x559f605a85fd]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 7:
(__libc_start_main()+0xeb) [0x7fe5336ba09b]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 8: (_start()+0x2a)
[0x559f605b93da]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:
2020-07-10T11:43:39.196+0200 7fe5331a65c0 -1 *** Caught signal
(Aborted) **
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: in thread
7fe5331a65c0 thread_name:ceph-mon
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: ceph version
15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 1: (()+0x12730)
[0x7fe533bea730]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2:
(gsignal()+0x10b) [0x7fe5336cd7bb]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 3: (abort()+0x121)
[0x7fe5336b8535]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 4:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0x1b2) [0x7fe534aea897]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 5:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce)
[0x559f6081b2ae]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 6: (main()+0x108d)
[0x559f605a85fd]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 7:
(__libc_start_main()+0xeb) [0x7fe5336ba09b]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 8: (_start()+0x2a)
[0x559f605b93da]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: NOTE: a copy of
the executable, or `objdump -rdS <executable>` is needed to
interpret this.
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: -1>
2020-07-10T11:43:39.192+0200 7fe5331a65c0 -1
/build/ceph-15.2.4/src/kv/RocksDBStore.cc: In function 'virtual int
RocksDBStore::get(const string&, const string&, ceph::bufferlist*)'
thread 7fe5331a65c0 time 2020-07-10T11:43:39.192838+0200
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:
/build/ceph-15.2.4/src/kv/RocksDBStore.cc: 1152: ceph_abort_msg("Bad
table magic number: expected 9863518390377041911, found
5836668660487291005 in
/var/lib/ceph/mon/ceph-cluster-r820-1/store.db/716151.sst")
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: ceph version
15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 1:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0xe1) [0x7fe534aea7c6]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce)
[0x559f6081b2ae]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 3: (main()+0x108d)
[0x559f605a85fd]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 4:
(__libc_start_main()+0xeb) [0x7fe5336ba09b]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 5: (_start()+0x2a)
[0x559f605b93da]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 0>
2020-07-10T11:43:39.196+0200 7fe5331a65c0 -1 *** Caught signal
(Aborted) **
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: in thread
7fe5331a65c0 thread_name:ceph-mon
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: ceph version
15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 1: (()+0x12730)
[0x7fe533bea730]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2:
(gsignal()+0x10b) [0x7fe5336cd7bb]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 3: (abort()+0x121)
[0x7fe5336b8535]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 4:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0x1b2) [0x7fe534aea897]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 5:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce)
[0x559f6081b2ae]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 6: (main()+0x108d)
[0x559f605a85fd]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 7:
(__libc_start_main()+0xeb) [0x7fe5336ba09b]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 8: (_start()+0x2a)
[0x559f605b93da]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: NOTE: a copy of
the executable, or `objdump -rdS <executable>` is needed to
interpret this.
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: -1>
2020-07-10T11:43:39.192+0200 7fe5331a65c0 -1
/build/ceph-15.2.4/src/kv/RocksDBStore.cc: In function 'virtual int
RocksDBStore::get(const string&, const string&, ceph::bufferlist*)'
thread 7fe5331a65c0 time 2020-07-10T11:43:39.192838+0200
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]:
/build/ceph-15.2.4/src/kv/RocksDBStore.cc: 1152: ceph_abort_msg("Bad
table magic number: expected 9863518390377041911, found
5836668660487291005 in
/var/lib/ceph/mon/ceph-cluster-r820-1/store.db/716151.sst")
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: ceph version
15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 1:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0xe1) [0x7fe534aea7c6]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce)
[0x559f6081b2ae]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 3: (main()+0x108d)
[0x559f605a85fd]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 4:
(__libc_start_main()+0xeb) [0x7fe5336ba09b]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 5: (_start()+0x2a)
[0x559f605b93da]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 0>
2020-07-10T11:43:39.196+0200 7fe5331a65c0 -1 *** Caught signal
(Aborted) **
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: in thread
7fe5331a65c0 thread_name:ceph-mon
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: ceph version
15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable)
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 1: (()+0x12730)
[0x7fe533bea730]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 2:
(gsignal()+0x10b) [0x7fe5336cd7bb]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 3: (abort()+0x121)
[0x7fe5336b8535]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 4:
(ceph::__ceph_abort(char const*, int, char const*,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&)+0x1b2) [0x7fe534aea897]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 5:
(RocksDBStore::get(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const&, ceph::buffer::v15_2_0::list*)+0x3ce)
[0x559f6081b2ae]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 6: (main()+0x108d)
[0x559f605a85fd]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 7:
(__libc_start_main()+0xeb) [0x7fe5336ba09b]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: 8: (_start()+0x2a)
[0x559f605b93da]
Jul 10 11:43:39 cluster-r820-1 ceph-mon[200827]: NOTE: a copy of
the executable, or `objdump -rdS <executable>` is needed to
interpret this.
Jul 10 11:43:39 cluster-r820-1 systemd[1]:
ceph-mon@cluster-r820-1.service: Main process exited, code=killed,
status=6/ABRT
Jul 10 11:43:39 cluster-r820-1 systemd[1]:
ceph-mon@cluster-r820-1.service: Failed with result 'signal'.
Jul 10 11:43:49 cluster-r820-1 systemd[1]:
ceph-mon@cluster-r820-1.service: Service RestartSec=10s expired,
scheduling restart.
...
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx