Re: OSD crashed while reparing inconsistent PG luminous

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Tue, Oct 17, 2017 at 9:51 AM Ana Aviles <ana@xxxxxxxxxxxx> wrote:
Hello all,

We had an inconsistent PG on our cluster. While performing PG repair
operation the OSD crashed. The OSD was not able to start again anymore,
and there was no hardware failure on the disk itself. This is the log output

2017-10-17 17:48:55.771384 7f234930d700 -1 log_channel(cluster) log
[ERR] : 2.2fc repair 1 missing, 0 inconsistent objects
2017-10-17 17:48:55.771417 7f234930d700 -1 log_channel(cluster) log
[ERR] : 2.2fc repair 3 errors, 1 fixed
2017-10-17 17:48:56.047896 7f234930d700 -1
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const
ObjectRecoveryInfo&, ObjectContextRef, bool, ObjectStore::Transaction*)'
thread 7f234930d700 time 2017-10-17 17:48:55.924115
/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p !=
recovery_info.ss.clone_snaps.end())

Hmm. The OSD got a push op containing a snapshot it doesn't think should exist. I also see that there's a comment "// hmm, should we warn?" on that assert.

Can you take a full log with "debug osd = 20" set, post it with ceph-post-file, and create a ticket on tracker.ceph.com?

Are all your OSDs running that same version?
-Greg
 

 ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x56236c8ff3f2]
 2: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo
const&, std::shared_ptr<ObjectContext>, bool,
ObjectStore::Transaction*)+0xd63) [0x56236c476213]
 3: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp const&,
PullOp*, std::__cxx11::list<ReplicatedBackend::pull_complete_info,
std::allocator<ReplicatedBackend::pull_complete_info> >*,
ObjectStore::Transaction*)+0x693) [0x56236c60d4d3]
 4:
(ReplicatedBackend::_do_pull_response(boost::intrusive_ptr<OpRequest>)+0x2b5)
[0x56236c60dd75]
 5:
(ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x20c)
[0x56236c61196c]
 6: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50)
[0x56236c521aa0]
 7: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x55d) [0x56236c48662d]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3a9)
[0x56236c3091a9]
 9: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x57) [0x56236c5a2ae7]
 10: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x130e) [0x56236c3307de]
 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x884)
[0x56236c9041e4]
 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x56236c907220]
 13: (()+0x76ba) [0x7f2366be96ba]
 14: (clone()+0x6d) [0x7f2365c603dd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

Thanks!

Ana


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux