Hi All, Ive got a strange situation hopefully someone can help with. We have a backfill occuring, that never completes, the destination osd of the recovery predictably crashes. Outing the destination osd so another osd takes the backfill causes a different osd in the cluster then to crash, boot, rinse and repeat. The logs show : --- begin dump of recent events --- -2> 2019-08-02 06:26:16.133337 7ff9fadf6700 5 -- 10.1.100.22:6808/3657777 >> 10.1.100.6:0/3789781062 conn(0x55d272342000 :6808 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=4238 cs=1 l=1). rx client.352388821 seq 74 0x55d2723ad740 osd_op(client.352388821.0:46698064 0.142e 0.b4eab42e (undecoded) ondisk+write+known_if_redirected e174744) v8 -1> 2019-08-02 06:26:16.133367 7ff9fadf6700 1 -- 10.1.100.22:6808/3657777 <== client.352388821 10.1.100.6:0/3789781062 74 ==== osd_op(client.352388821.0:46698064 0.142e 0.b4eab42e (undecoded) ondisk+write+known_if_redirected e174744) v8 ==== 248+0+16384 (881189615 0 2173568771) 0x55d2723ad740 con 0x55d272342000 0> 2019-08-02 06:26:16.185021 7ff9df594700 -1 *** Caught signal (Aborted) ** in thread 7ff9df594700 thread_name:tp_osd_tp ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable) 1: (()+0xa59c94) [0x55d25900ec94] 2: (()+0x110e0) [0x7ff9fe9a10e0] 3: (gsignal()+0xcf) [0x7ff9fd968fff] 4: (abort()+0x16a) [0x7ff9fd96a42a] 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x28e) [0x55d2590573ee] 6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo const&, std::shared_ptr<ObjectContext>, bool, ObjectStore::Transaction*)+0x1287) [0x55d258bad597] 7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*, ObjectStore::Transaction*)+0x305) [0x55d258d3d6e5] 8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x12e) [0x55d258d3d8fe] 9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2e3) [0x55d258d4d723] 10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x55d258c50ce0] 11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x4f1) [0x55d258bb44a1] 12: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3ab) [0x55d258a21dcb] 13: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x5a) [0x55d258cda97a] 14: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x102d) [0x55d258a4fdbd] 15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ef) [0x55d25905c0cf] 16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d25905f3d0] 17: (()+0x74a4) [0x7ff9fe9974a4] 18: (clone()+0x3f) [0x7ff9fda1ed0f] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 xio 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 kinetic 1/ 5 fuse 1/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-osd.35.log --- end dump of recent events --- Any help would be very much appreciated. All the Best Kevin _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com