Hi, Yes, I am aware that Jewel is EOL :-) On a Jewel cluster I'm seeing OSDs crash shortly after they start with something similar as this issue: http://tracker.ceph.com/issues/15017 This cluster is running Jewel 10.2.11 and I'm seeing exactly the same crash happening on Placement Groups which belong to cache tiering pools (57 and 65). Looking at it with GDB it crashes in osd/ReplicatedPG.cc on this line: last_clone_oid.snap = ctx->new_snapset.clone_overlap.rbegin()->first; I am not very familiar with the Snapshotting and PGs mechanism and before I would attempt an upgrade to Luminous I rather debug this. -3> 2019-01-09 16:37:13.002666 7f1dcb7b3700 10 osd.6 pg_epoch: 612210 pg[65.243( v 612202'563066013 lc 611312'563065968 (609463'563062792,612202'563066013] local-les=612210 n=11511 ec=149533 les/c/f 612210/611807/0 612208/612209/612209) [6,786]/[6,928] r=0 lpr=612209 pi=609825-612208/51 bft=786 crt=612202'563066013 lcod 0'0 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped NIBBLEWISE m=17] do_osd_op delete -2> 2019-01-09 16:37:13.002684 7f1dcb7b3700 20 osd.6 pg_epoch: 612210 pg[65.243( v 612202'563066013 lc 611312'563065968 (609463'563062792,612202'563066013] local-les=612210 n=11511 ec=149533 les/c/f 612210/611807/0 612208/612209/612209) [6,786]/[6,928] r=0 lpr=612209 pi=609825-612208/51 bft=786 crt=612202'563066013 lcod 0'0 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped NIBBLEWISE m=17] _delete_oid setting whiteout on 65:c2607a60:::rbd_data.e53c3c27c0089c.000000000000009e:head -1> 2019-01-09 16:37:13.002705 7f1dcb7b3700 20 osd.6 pg_epoch: 612210 pg[65.243( v 612202'563066013 lc 611312'563065968 (609463'563062792,612202'563066013] local-les=612210 n=11511 ec=149533 les/c/f 612210/611807/0 612208/612209/612209) [6,786]/[6,928] r=0 lpr=612209 pi=609825-612208/51 bft=786 crt=612202'563066013 lcod 0'0 mlcod 0'0 active+recovery_wait+undersized+degraded+remapped NIBBLEWISE m=17] make_writeable 65:c2607a60:::rbd_data.e53c3c27c0089c.000000000000009e:head snapset=0x7f1e06349bb8 snapc=a4=[] 0> 2019-01-09 16:37:13.005749 7f1dcb7b3700 -1 *** Caught signal (Segmentation fault) ** in thread 7f1dcb7b3700 thread_name:tp_osd_tp ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e) 1: (()+0x9f1c2a) [0x7f1deed72c2a] 2: (()+0xf100) [0x7f1decab5100] 3: (()+0x7537a) [0x7f1deb99137a] 4: (ReplicatedPG::make_writeable(ReplicatedPG::OpContext*)+0x138) [0x7f1dee90ec28] 5: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x56b) [0x7f1dee9106db] 6: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x920) [0x7f1dee911110] 7: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x2843) [0x7f1dee915083] 8: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x747) [0x7f1dee8d0b67] 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x41d) [0x7f1dee780cdd] 10: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x6d) [0x7f1dee780f2d] 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x869) [0x7f1dee784a09] 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x887) [0x7f1deee61b07] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f1deee63a70] 14: (()+0x7dc5) [0x7f1decaaddc5] 15: (clone()+0x6d) [0x7f1deb138ced] Has anybody seen this before? Right now there are multiple PGs in the cache tier which are down and I'm trying to fix it. Maybe somebody has seen this before and knows how to fix it. Thanks, Wido