Problem with OSDs that do not start

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We had a major problem with our ceph installation and we have been unable to restore the cluster. The problem started as the 6 OSDs of a specific node started being marked down by their peers, but the marked down OSDs were asking the mons to re-up them. This flapped for some hours (2016-09-09 ~20:00 - 2016-09-10 ~03:00), two days ago, and eventually the OSDs were definitively marked as down. Following that, probably due to the load of the cluster recovery (??), the master mon was marked_down (2016-09-10 ~06:15). The cluster was not in the best shape, with unfound objects as we tried to restart the mons.

Since then, we have the 6 OSDs of that specific node down, having problems to even start as a service, and the cluster stuck at recovery.

The specific question is this: out of the 6 OSDs, it seems that most, if not all, problematic pgs are on the osd.117 (mounted at /rados/rd0-19-03). When we try to restart it, we get this in the logs.

2016-09-11 01:45:51.071186 7f2c1ae2b880 0 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-osd, pid 9893 2016-09-11 01:45:51.111890 7f2c1ae2b880 0 filestore(/rados/rd0-19-03) backend generic (magic 0xef53) 2016-09-11 01:45:51.112727 7f2c1ae2b880 0 genericfilestorebackend(/rados/rd0-19-03) detect_features: FIEMAP ioctl is supported and appears to work 2016-09-11 01:45:51.112737 7f2c1ae2b880 0 genericfilestorebackend(/rados/rd0-19-03) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2016-09-11 01:45:51.119644 7f2c1ae2b880 0 genericfilestorebackend(/rados/rd0-19-03) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2016-09-11 01:45:51.121748 7f2c1ae2b880 0 filestore(/rados/rd0-19-03) limited size xattrs 2016-09-11 01:45:51.266109 7f2c1ae2b880 0 filestore(/rados/rd0-19-03) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2016-09-11 01:45:51.314951 7f2c1ae2b880 1 journal _open /dev/sde fd 19: 600092704768 bytes, block size 4096 bytes, directio = 1, aio = 1 2016-09-11 01:45:51.353130 7f2c1ae2b880 1 journal _open /dev/sde fd 19: 600092704768 bytes, block size 4096 bytes, directio = 1, aio = 1 2016-09-11 01:45:51.379729 7f2c1ae2b880 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello
2016-09-11 01:45:51.400596 7f2c1ae2b880  0 osd.117 284624 load_pgs
2016-09-11 01:46:13.978924 7f2c1ae2b880 -1 osd/PGLog.cc: In function 'static void PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, const pg_info_t&, std::map<eversion_t, hobject_t>&, PGLog::IndexedLog&, pg_missing_t&, std::ostringstream&, std::set<std::basic_string<char> >*)' thread 7f2c1ae2b880 time 2016-09-11 01:46:13.976744
osd/PGLog.cc: 908: FAILED assert(last_e.version.version < e.version.version)

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0xc0e746] 2: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, std::map<eversion_t, hobject_t, std::less<eversion_t>, std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&, std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, std::set<std::string, std::less<std::string>, std::allocator<std::string>
 >*)+0x1a04) [0x76fe14]
 3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x1e2) [0x7f6a02]
 4: (OSD::load_pgs()+0xac0) [0x6ab780]
 5: (OSD::init()+0x14da) [0x6aefca]
 6: (main()+0x2848) [0x633b38]
 7: (__libc_start_main()+0xf5) [0x7f2c1814cb45]
 8: /usr/bin/ceph-osd() [0x64d7c7]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


It seems that the version of some object is creating the problem. We have been unable to identify which object is causing the problem and how to recover it or isolate it.

Any thoughs?

--
------------------------------------------------------------
Παναγιώτης Γκότσης                      pgotsis@xxxxxxxxxxxx
Μηχανικός Συστημάτων
Κέντρο Διαχείρισης Δικτύου
Εθνικό Δίκτυο Έρευνας και Τεχνολογίας - http://www.grnet.gr

Panayiotis Gotsis                       pgotsis@xxxxxxxxxxxx
System Engineer
Network Operations Center
Greek Research & Technology Network   - http://www.grnet.gr

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux