Hi,
--
I'm seeing lots of issues with my CEPH installation. The health of the system is degraded and many of the OSD are down.
# ceph -v
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
# ceph health
HEALTH_ERR 2002 pgs degraded; 14 pgs down; 180 pgs inconsistent; 14 pgs peering; 1 pgs stale; 2002 pgs stuck degraded; 14 pgs stuck inactive; 1 pgs stuck stale; 2320 pgs stuck unclean; 2002 pgs stuck undersized; 2002 pgs undersized; 100 requests are blocked > 32 sec; recovery 38033332/531925830 objects degraded (7.150%); recovery 48881596/531925830 objects misplaced (9.190%); 12623 scrub errors; 11/320 in osds are down; noout flag(s) set
Log for one of the down OSDes shows:
-5> 2016-02-05 19:10:45.294873 7fd4d58e4700 1 -- 10.31.0.3:6835/157558 --> 10.31.0.5:0/3796 -- osd_ping(ping_reply e144138 stamp 2016-02-05 19:10:45.286934) v2 -- ?+
0 0x4359a00 con 0x2bc9ac60
-4> 2016-02-05 19:10:45.294915 7fd4d70e7700 1 -- 10.31.0.67:6835/157558 --> 10.31.0.5:0/3796 -- osd_ping(ping_reply e144138 stamp 2016-02-05 19:10:45.286934) v2 -- ?
+0 0x27e21800 con 0x2bacd700
-3> 2016-02-05 19:10:45.341383 7fd4e2ea8700 0 filestore(/var/lib/ceph/osd/ceph-299) error (39) Directory not empty not handled on operation 0x12c88178 (6494115.0.1,
or op 1, counting from 0)
-2> 2016-02-05 19:10:45.341477 7fd4e2ea8700 0 filestore(/var/lib/ceph/osd/ceph-299) ENOTEMPTY suggests garbage data in osd data dir
-1> 2016-02-05 19:10:45.341493 7fd4e2ea8700 0 filestore(/var/lib/ceph/osd/ceph-299) transaction dump:
{
"ops": [
{
"op_num": 0,
"op_name": "remove",
"collection": "70.532s3_head",
"oid": "532\/\/head\/\/70\/18446744073709551615\/3"
},
{
"op_num": 1,
"op_name": "rmcoll",
"collection": "70.532s3_head"
}
]
}
0> 2016-02-05 19:10:45.343794 7fd4e2ea8700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadP
ool::TPHandle*)' thread 7fd4e2ea8700 time 2016-02-05 19:10:45.341673
os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error")
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc60eb]
2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xa52) [0x923d12]
3: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x92a3a4]
4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x16a) [0x92a52a]
5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb6b4e]
6: (ThreadPool::WorkThread::entry()+0x10) [0xbb7bf0]
7: (()+0x8182) [0x7fd4ef916182]
8: (clone()+0x6d) [0x7fd4ede8147d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.299.log
--- end dump of recent events ---
2016-02-05 19:10:45.441428 7fd4e2ea8700 -1 *** Caught signal (Aborted) **
in thread 7fd4e2ea8700
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
1: /usr/bin/ceph-osd() [0xacd7ba]
2: (()+0x10340) [0x7fd4ef91e340]
3: (gsignal()+0x39) [0x7fd4eddbdcc9]
4: (abort()+0x148) [0x7fd4eddc10d8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fd4ee6c8535]
6: (()+0x5e6d6) [0x7fd4ee6c66d6]
7: (()+0x5e703) [0x7fd4ee6c6703]
8: (()+0x5e922) [0x7fd4ee6c6922]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xbc62d8]
10: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xa52) [0x923d12]
11: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x92a3a4
]
12: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x16a) [0x92a52a]
13: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb6b4e]
14: (ThreadPool::WorkThread::entry()+0x10) [0xbb7bf0]
15: (()+0x8182) [0x7fd4ef916182]
16: (clone()+0x6d) [0x7fd4ede8147d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- begin dump of recent events ---
-4> 2016-02-05 19:10:45.355813 7fd4d58e4700 1 -- 10.31.0.3:6835/157558 <== osd.1 10.31.0.101:0/197780 23431 ==== osd_ping(ping e144138 stamp 2016-02-05 19:10:45.3440
20) v2 ==== 47+0+0 (1893056775 0 0) 0x36782a00 con 0x2c6c8580
-3> 2016-02-05 19:10:45.355853 7fd4d58e4700 1 -- 10.31.0.3:6835/157558 --> 10.31.0.101:0/197780 -- osd_ping(ping_reply e144138 stamp 2016-02-05 19:10:45.344020) v2 -
- ?+0 0x29702800 con 0x2c6c8580
-2> 2016-02-05 19:10:45.356076 7fd4d70e7700 1 -- 10.31.0.67:6835/157558 <== osd.1 10.31.0.101:0/197780 23431 ==== osd_ping(ping e144138 stamp 2016-02-05 19:10:45.344
020) v2 ==== 47+0+0 (1893056775 0 0) 0x2cf84200 con 0x2bc9c260
-1> 2016-02-05 19:10:45.356627 7fd4d70e7700 1 -- 10.31.0.67:6835/157558 --> 10.31.0.101:0/197780 -- osd_ping(ping_reply e144138 stamp 2016-02-05 19:10:45.344020) v2
-- ?+0 0x2f5cae00 con 0x2bc9c260
0> 2016-02-05 19:10:45.441428 7fd4e2ea8700 -1 *** Caught signal (Aborted) **
in thread 7fd4e2ea8700
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
1: /usr/bin/ceph-osd() [0xacd7ba]
2: (()+0x10340) [0x7fd4ef91e340]
3: (gsignal()+0x39) [0x7fd4eddbdcc9]
4: (abort()+0x148) [0x7fd4eddc10d8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fd4ee6c8535]
6: (()+0x5e6d6) [0x7fd4ee6c66d6]
7: (()+0x5e703) [0x7fd4ee6c6703]
8: (()+0x5e922) [0x7fd4ee6c6922]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xbc62d8]
10: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xa52) [0x923d12]
11: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x92a3a4
]
12: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x16a) [0x92a52a]
13: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb6b4e]
14: (ThreadPool::WorkThread::entry()+0x10) [0xbb7bf0]
15: (()+0x8182) [0x7fd4ef916182]
16: (clone()+0x6d) [0x7fd4ede8147d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.299.log
-------------------------
This log is similar on other OSDs, would this be the best procedure to repair the OSDs: http://tracker.ceph.com/issues/12428 ?
Thanks,
Jeff
Jeffrey McDonald, PhD Assistant Director for HPC Operations Minnesota Supercomputing Institute University of Minnesota Twin Cities 599 Walter Library email: jeffrey.mcdonald@xxxxxxxxxxx 117 Pleasant St SE phone: +1 612 625-6905 Minneapolis, MN 55455 fax: +1 612 624-8861
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com