We had a major problem with our ceph installation and we have been
unable to restore the cluster. The problem started as the 6 OSDs of a
specific node started being marked down by their peers, but the marked
down OSDs were asking the mons to re-up them. This flapped for some
hours (2016-09-09 ~20:00 - 2016-09-10 ~03:00), two days ago, and
eventually the OSDs were definitively marked as down. Following that,
probably due to the load of the cluster recovery (??), the master mon
was marked_down (2016-09-10 ~06:15). The cluster was not in the best
shape, with unfound objects as we tried to restart the mons.
Since then, we have the 6 OSDs of that specific node down, having
problems to even start as a service, and the cluster stuck at recovery.
The specific question is this: out of the 6 OSDs, it seems that most, if
not all, problematic pgs are on the osd.117 (mounted at
/rados/rd0-19-03). When we try to restart it, we get this in the logs.
2016-09-11 01:45:51.071186 7f2c1ae2b880 0 ceph version 0.94.7
(d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-osd, pid 9893
2016-09-11 01:45:51.111890 7f2c1ae2b880 0 filestore(/rados/rd0-19-03)
backend generic (magic 0xef53)
2016-09-11 01:45:51.112727 7f2c1ae2b880 0
genericfilestorebackend(/rados/rd0-19-03) detect_features: FIEMAP ioctl
is supported and appears to work
2016-09-11 01:45:51.112737 7f2c1ae2b880 0
genericfilestorebackend(/rados/rd0-19-03) detect_features: FIEMAP ioctl
is disabled via 'filestore fiemap' config option
2016-09-11 01:45:51.119644 7f2c1ae2b880 0
genericfilestorebackend(/rados/rd0-19-03) detect_features: syncfs(2)
syscall fully supported (by glibc and kernel)
2016-09-11 01:45:51.121748 7f2c1ae2b880 0 filestore(/rados/rd0-19-03)
limited size xattrs
2016-09-11 01:45:51.266109 7f2c1ae2b880 0 filestore(/rados/rd0-19-03)
mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2016-09-11 01:45:51.314951 7f2c1ae2b880 1 journal _open /dev/sde fd 19:
600092704768 bytes, block size 4096 bytes, directio = 1, aio = 1
2016-09-11 01:45:51.353130 7f2c1ae2b880 1 journal _open /dev/sde fd 19:
600092704768 bytes, block size 4096 bytes, directio = 1, aio = 1
2016-09-11 01:45:51.379729 7f2c1ae2b880 0 <cls>
cls/hello/cls_hello.cc:271: loading cls_hello
2016-09-11 01:45:51.400596 7f2c1ae2b880 0 osd.117 284624 load_pgs
2016-09-11 01:46:13.978924 7f2c1ae2b880 -1 osd/PGLog.cc: In function
'static void PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t,
const pg_info_t&, std::map<eversion_t, hobject_t>&,
PGLog::IndexedLog&, pg_missing_t&, std::ostringstream&,
std::set<std::basic_string<char> >*)' thread 7f2c1ae2b880 time
2016-09-11 01:46:13.976744
osd/PGLog.cc: 908: FAILED assert(last_e.version.version < e.version.version)
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x76) [0xc0e746]
2: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t,
pg_info_t const&, std::map<eversion_t, hobject_t, std::less<eversion_t>,
std::allocator<std::pair<eversion_t const, hobject_t> >
>&, PGLog::IndexedLog&, pg_missing_t&, std::basic_ostringstream<char,
std::char_traits<char>, std::allocator<char> >&, std::set<std::string,
std::less<std::string>, std::allocator<std::string>
>*)+0x1a04) [0x76fe14]
3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x1e2) [0x7f6a02]
4: (OSD::load_pgs()+0xac0) [0x6ab780]
5: (OSD::init()+0x14da) [0x6aefca]
6: (main()+0x2848) [0x633b38]
7: (__libc_start_main()+0xf5) [0x7f2c1814cb45]
8: /usr/bin/ceph-osd() [0x64d7c7]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
It seems that the version of some object is creating the problem. We
have been unable to identify which object is causing the problem and how
to recover it or isolate it.
Any thoughs?
--
------------------------------------------------------------
Παναγιώτης Γκότσης pgotsis@xxxxxxxxxxxx
Μηχανικός Συστημάτων
Κέντρο Διαχείρισης Δικτύου
Εθνικό Δίκτυο Έρευνας και Τεχνολογίας - http://www.grnet.gr
Panayiotis Gotsis pgotsis@xxxxxxxxxxxx
System Engineer
Network Operations Center
Greek Research & Technology Network - http://www.grnet.gr
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com