Re: CEPH health issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



You need to get your OSD back online.

 


From: "Jeffrey McDonald" <jmcdonal@xxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx
Sent: Saturday, February 6, 2016 8:18:06 AM
Subject: CEPH health issues

Hi, 
I'm seeing lots  of issues with my CEPH installation.    The health of the system is degraded and many of the OSD are down.   

# ceph -v
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)

#   ceph health 
HEALTH_ERR 2002 pgs degraded; 14 pgs down; 180 pgs inconsistent; 14 pgs peering; 1 pgs stale; 2002 pgs stuck degraded; 14 pgs stuck inactive; 1 pgs stuck stale; 2320 pgs stuck unclean; 2002 pgs stuck undersized; 2002 pgs undersized; 100 requests are blocked > 32 sec; recovery 38033332/531925830 objects degraded (7.150%); recovery 48881596/531925830 objects misplaced (9.190%); 12623 scrub errors; 11/320 in osds are down; noout flag(s) set

Log for one of the down OSDes shows:

    -5> 2016-02-05 19:10:45.294873 7fd4d58e4700  1 -- 10.31.0.3:6835/157558 --> 10.31.0.5:0/3796 -- osd_ping(ping_reply e144138 stamp 2016-02-05 19:10:45.286934) v2 -- ?+
0 0x4359a00 con 0x2bc9ac60
    -4> 2016-02-05 19:10:45.294915 7fd4d70e7700  1 -- 10.31.0.67:6835/157558 --> 10.31.0.5:0/3796 -- osd_ping(ping_reply e144138 stamp 2016-02-05 19:10:45.286934) v2 -- ?
+0 0x27e21800 con 0x2bacd700
    -3> 2016-02-05 19:10:45.341383 7fd4e2ea8700  0 filestore(/var/lib/ceph/osd/ceph-299)  error (39) Directory not empty not handled on operation 0x12c88178 (6494115.0.1,
 or op 1, counting from 0)
    -2> 2016-02-05 19:10:45.341477 7fd4e2ea8700  0 filestore(/var/lib/ceph/osd/ceph-299) ENOTEMPTY suggests garbage data in osd data dir
    -1> 2016-02-05 19:10:45.341493 7fd4e2ea8700  0 filestore(/var/lib/ceph/osd/ceph-299)  transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "remove",
            "collection": "70.532s3_head",
            "oid": "532\/\/head\/\/70\/18446744073709551615\/3"
        },
        {
            "op_num": 1,
            "op_name": "rmcoll",
            "collection": "70.532s3_head"
        }
    ]
}

     0> 2016-02-05 19:10:45.343794 7fd4e2ea8700 -1 os/FileStore.cc: In function 'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadP
ool::TPHandle*)' thread 7fd4e2ea8700 time 2016-02-05 19:10:45.341673
os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error")

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc60eb]
 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xa52) [0x923d12]
 3: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x92a3a4]
 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x16a) [0x92a52a]
 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb6b4e]
 6: (ThreadPool::WorkThread::entry()+0x10) [0xbb7bf0]
 7: (()+0x8182) [0x7fd4ef916182]
 8: (clone()+0x6d) [0x7fd4ede8147d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.299.log
--- end dump of recent events ---
2016-02-05 19:10:45.441428 7fd4e2ea8700 -1 *** Caught signal (Aborted) **
 in thread 7fd4e2ea8700

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: /usr/bin/ceph-osd() [0xacd7ba]
 2: (()+0x10340) [0x7fd4ef91e340]
 3: (gsignal()+0x39) [0x7fd4eddbdcc9]
 4: (abort()+0x148) [0x7fd4eddc10d8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fd4ee6c8535]
 6: (()+0x5e6d6) [0x7fd4ee6c66d6]
 7: (()+0x5e703) [0x7fd4ee6c6703]
 8: (()+0x5e922) [0x7fd4ee6c6922]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xbc62d8]
 10: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xa52) [0x923d12]
 11: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x92a3a4
]
 12: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x16a) [0x92a52a]
 13: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb6b4e]
 14: (ThreadPool::WorkThread::entry()+0x10) [0xbb7bf0]
 15: (()+0x8182) [0x7fd4ef916182]
 16: (clone()+0x6d) [0x7fd4ede8147d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
    -4> 2016-02-05 19:10:45.355813 7fd4d58e4700  1 -- 10.31.0.3:6835/157558 <== osd.1 10.31.0.101:0/197780 23431 ==== osd_ping(ping e144138 stamp 2016-02-05 19:10:45.3440
20) v2 ==== 47+0+0 (1893056775 0 0) 0x36782a00 con 0x2c6c8580
    -3> 2016-02-05 19:10:45.355853 7fd4d58e4700  1 -- 10.31.0.3:6835/157558 --> 10.31.0.101:0/197780 -- osd_ping(ping_reply e144138 stamp 2016-02-05 19:10:45.344020) v2 -
- ?+0 0x29702800 con 0x2c6c8580
    -2> 2016-02-05 19:10:45.356076 7fd4d70e7700  1 -- 10.31.0.67:6835/157558 <== osd.1 10.31.0.101:0/197780 23431 ==== osd_ping(ping e144138 stamp 2016-02-05 19:10:45.344
020) v2 ==== 47+0+0 (1893056775 0 0) 0x2cf84200 con 0x2bc9c260
    -1> 2016-02-05 19:10:45.356627 7fd4d70e7700  1 -- 10.31.0.67:6835/157558 --> 10.31.0.101:0/197780 -- osd_ping(ping_reply e144138 stamp 2016-02-05 19:10:45.344020) v2 
-- ?+0 0x2f5cae00 con 0x2bc9c260
     0> 2016-02-05 19:10:45.441428 7fd4e2ea8700 -1 *** Caught signal (Aborted) **
 in thread 7fd4e2ea8700

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: /usr/bin/ceph-osd() [0xacd7ba]
 2: (()+0x10340) [0x7fd4ef91e340]
 3: (gsignal()+0x39) [0x7fd4eddbdcc9]
 4: (abort()+0x148) [0x7fd4eddc10d8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fd4ee6c8535]
 6: (()+0x5e6d6) [0x7fd4ee6c66d6]
 7: (()+0x5e703) [0x7fd4ee6c6703]
 8: (()+0x5e922) [0x7fd4ee6c6922]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xbc62d8]
 10: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xa52) [0x923d12]
 11: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x92a3a4
]
 12: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x16a) [0x92a52a]
 13: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb6b4e]
 14: (ThreadPool::WorkThread::entry()+0x10) [0xbb7bf0]
 15: (()+0x8182) [0x7fd4ef916182]
 16: (clone()+0x6d) [0x7fd4ede8147d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.299.log

-------------------------


This log is similar on other OSDs, would this be the best procedure to repair the OSDs: http://tracker.ceph.com/issues/12428 ?

Thanks,
Jeff




--
Jeffrey McDonald, PhD
Assistant Director for HPC Operations
Minnesota Supercomputing Institute
University of Minnesota Twin Cities
599 Walter Library           email: jeffrey.mcdonald@xxxxxxxxxxx
117 Pleasant St SE           phone: +1 612 625-6905
Minneapolis, MN 55455        fax:   +1 612 624-8861


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux