Hello Cephers! trying to repair an inconsistent PG results in the osd dying with an assertion failure: 0> 2015-12-01 07:22:13.398006 7f76d6594700 -1 osd/SnapMapper.cc: In function 'int SnapMapper::get_snaps(const hobject_t& , SnapMapper::object_snaps*)' thread 7f76d6594700 time 2015-12-01 07:22:13.394900 osd/SnapMapper.cc: 153: FAILED assert(!out->snaps.empty()) ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc60eb] 2: (SnapMapper::get_snaps(hobject_t const&, SnapMapper::object_snaps*)+0x40c) [0x72aecc] 3: (SnapMapper::get_snaps(hobject_t const&, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> >*)+0xa2) [0x72 b062] 4: (PG::_scan_snaps(ScrubMap&)+0x454) [0x7f2f84] 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x218) [0x7f3ba8] 6: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x480) [0x7f9da0] 7: (PG::scrub(ThreadPool::TPHandle&)+0x2ee) [0x7fb48e] 8: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x6cdbf9] 9: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb6b4e] 10: (ThreadPool::WorkThread::entry()+0x10) [0xbb7bf0] 11: (()+0x8182) [0x7f76fe072182] 12: (clone()+0x6d) [0x7f76fc5dd47d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.339.log --- end dump of recent events --- 2015-12-01 07:22:13.476525 7f76d6594700 -1 *** Caught signal (Aborted) ** in thread 7f76d6594700 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) 1: /usr/bin/ceph-osd() [0xacd7ba] 2: (()+0x10340) [0x7f76fe07a340] 3: (gsignal()+0x39) [0x7f76fc519cc9] 4: (abort()+0x148) [0x7f76fc51d0d8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f76fce24535] 6: (()+0x5e6d6) [0x7f76fce226d6] 7: (()+0x5e703) [0x7f76fce22703] 8: (()+0x5e922) [0x7f76fce22922] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xbc62d8] 10: (SnapMapper::get_snaps(hobject_t const&, SnapMapper::object_snaps*)+0x40c) [0x72aecc] 11: (SnapMapper::get_snaps(hobject_t const&, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> >*)+0xa2) [0x72b062] 12: (PG::_scan_snaps(ScrubMap&)+0x454) [0x7f2f84] 13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x218) [0x7f3ba8] 14: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x480) [0x7f9da0] 15: (PG::scrub(ThreadPool::TPHandle&)+0x2ee) [0x7fb48e] 16: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x6cdbf9] 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb6b4e] 18: (ThreadPool::WorkThread::entry()+0x10) [0xbb7bf0] 19: (()+0x8182) [0x7f76fe072182] 20: (clone()+0x6d) [0x7f76fc5dd47d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- begin dump of recent events --- -4> 2015-12-01 07:22:13.403280 7f76e4db1700 1 -- 10.9.246.104:6887/8548 <== osd.109 10.9.245.204:0/3407 13 ==== osd_ping(ping e320057 stamp 2015-12-01 07:22:13.399779) v2 ==== 47+0+0 (1340520147 0 0) 0x22456800 con 0x22340b00 -3> 2015-12-01 07:22:13.403313 7f76e4db1700 1 -- 10.9.246.104:6887/8548 --> 10.9.245.204:0/3407 -- osd_ping(ping_reply e320057 stamp 2015-12-01 07:22:13.399779) v2 -- ?+0 0x23e3be00 con 0x22340b00 -2> 2015-12-01 07:22:13.403365 7f76e35ae700 1 -- 10.9.246.104:6883/8548 <== osd.109 10.9.245.204:0/3407 13 ==== osd_ping(ping e320057 stamp 2015-12-01 07:22:13.399779) v2 ==== 47+0+0 (1340520147 0 0) 0x22457600 con 0x22570d60 -1> 2015-12-01 07:22:13.403405 7f76e35ae700 1 -- 10.9.246.104:6883/8548 --> 10.9.245.204:0/3407 -- osd_ping(ping_reply e320057 stamp 2015-12-01 07:22:13.399779) v2 -- ?+0 0x23e3fe00 con 0x22570d60 0> 2015-12-01 07:22:13.476525 7f76d6594700 -1 *** Caught signal (Aborted) ** in thread 7f76d6594700 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) 1: /usr/bin/ceph-osd() [0xacd7ba] 2: (()+0x10340) [0x7f76fe07a340] 3: (gsignal()+0x39) [0x7f76fc519cc9] 4: (abort()+0x148) [0x7f76fc51d0d8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f76fce24535] 6: (()+0x5e6d6) [0x7f76fce226d6] 7: (()+0x5e703) [0x7f76fce22703] 8: (()+0x5e922) [0x7f76fce22922] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x278) [0xbc62d8] 10: (SnapMapper::get_snaps(hobject_t const&, SnapMapper::object_snaps*)+0x40c) [0x72aecc] 11: (SnapMapper::get_snaps(hobject_t const&, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> >*)+0xa2) [0x72b062] 12: (PG::_scan_snaps(ScrubMap&)+0x454) [0x7f2f84] 13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, unsigned int, ThreadPool::TPHandle&)+0x218) [0x7f3ba8] 14: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x480) [0x7f9da0] 15: (PG::scrub(ThreadPool::TPHandle&)+0x2ee) [0x7fb48e] 16: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x6cdbf9] 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb6b4e] 18: (ThreadPool::WorkThread::entry()+0x10) [0xbb7bf0] 19: (()+0x8182) [0x7f76fe072182] 20: (clone()+0x6d) [0x7f76fc5dd47d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.339.log --- end dump of recent events --- 2015-12-01 07:22:13.889279 7f0be9daf900 0 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43), process ceph-osd, pid 12810 2015-12-01 07:22:13.904298 7f0be9daf900 0 filestore(/var/lib/ceph/osd/ceph-339) backend xfs (magic 0x58465342) As it mentioned snapshots i generously deleted some and re-repaired. This took 2 other boxes out of operation with kernel_hung_tasks of ceph-osds waiting for xfs_fs_sync and load ~10000. Thankfully power-cycling those was enough. osd-339 is now much more chattybefore dying: http://www.traced.net/u/toasta/tmp/ceph-osd.339.log.txt How do I get this pg to cooperate again? Is it safe to just delete it from the filesystem and let it repair (from one of the replicas)? Thx in advance Benedikt _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com