Update: When repairing the PG I get a different error: osd.14 80.69.45.76:6813/4059849 27 : cluster [INF] 7.374 repair starts osd.14 80.69.45.76:6813/4059849 28 : cluster [ERR] 7.374 recorded data digest 0xebbbfb83 != on disk 0x43d61c5d on 7/a29aab74/rbd_data.59cb9c679e2a9e3.0000000000003096/29c44 osd.14 80.69.45.76:6813/4059849 29 : cluster [ERR] repair 7.374 7/a29aab74/rbd_data.59cb9c679e2a9e3.0000000000003096/29c44 is an unexpected clone osd.14 80.69.45.76:6813/4059849 30 : cluster [ERR] 7.374 repair stat mismatch, got 2110/2111 objects, 131/132 clones, 2110/2111 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 8304141312/8304264192 bytes,0/0 hit_set_archive bytes. osd.14 80.69.45.76:6813/4059849 31 : cluster [ERR] 7.374 repair 3 errors, 1 fixed osd.14 80.69.45.76:6813/4059849 32 : cluster [INF] 7.374 deep-scrub starts osd.14 80.69.45.76:6813/4059849 33 : cluster [ERR] deep-scrub 7.374 7/a29aab74/rbd_data.59cb9c679e2a9e3.0000000000003096/29c44 is an unexpected clone osd.14 80.69.45.76:6813/4059849 34 : cluster [ERR] 7.374 deep-scrub 1 errors Sorry for being so noisy in list, but maybe someone can now recognize what to do and give me a hint. rgds., j #On 01.02.20 10:20, Andreas John wrote: > Hello, > > for those sumbling upon a similar issue: I was able to mitigate the > issue, by setting > > > === 8< === > > [osd.14] > osd_pg_max_concurrent_snap_trims = 0 > > ========= > > > in ceph.conf. You don't need to restart the osd, osd crash crash + > systemd will do it for you :) > > Now the osd in question does no trimming anymore and thus stays up. > > Now I let the deep-scrubber run, and press thumbs it will clean up the > mess. > > > In case I need to clean up manually, could anyone give a hint how to > find the rbd with that snap? The logs says: > > > 7faf8f716700 -1 log_channel(cluster) log [ERR] : trim_object Snap 29c44 > not in clones > > > 1.) What is the 7faf8f716700 at the beginning of the log? Is it a daemon > id? > > 2.) About the Snap "ID" 29c44: In the filesystem I see > > ...ceph-14/current/7.374_head/DIR_4/DIR_7/DIR_B/DIR_A/rbd\udata.59cb9c679e2a9e3.0000000000003096__29c44_A29AAB74__7 > > Do I read it correctly that in PG 7.374 there is with rbd prefix > 59cb9c679e2a9e3 an object that ends with ..3096, which has a snap ID > 29c44 ... ? What does the part A29AAB74__7 ? > > I was nit able to find in docs how the directory / filename is structured. > > > Best Regrads, > > j. > > > > On 31.01.20 16:04, Andreas John wrote: >> Hello, >> >> in my cluster one after the other OSD dies until I recognized that it >> was simply an "abort" in the daemon caused probably by >> >> 2020-01-31 15:54:42.535930 7faf8f716700 -1 log_channel(cluster) log >> [ERR] : trim_object Snap 29c44 not in clones >> >> >> Close to this msg I get a stracktrace: >> >> >> ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af) >> 1: /usr/bin/ceph-osd() [0xb35f7d] >> 2: (()+0x11390) [0x7f0fec74b390] >> 3: (gsignal()+0x38) [0x7f0feab43428] >> 4: (abort()+0x16a) [0x7f0feab4502a] >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7f0feb48684d] >> 6: (()+0x8d6b6) [0x7f0feb4846b6] >> 7: (()+0x8d701) [0x7f0feb484701] >> 8: (()+0x8d919) [0x7f0feb484919] >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x27e) [0xc3776e] >> 10: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x10dd) [0x868cfd] >> 11: (ReplicatedPG::repop_all_committed(ReplicatedPG::RepGather*)+0x80) >> [0x8690e0] >> 12: (Context::complete(int)+0x9) [0x6c8799] >> 13: (void ReplicatedBackend::sub_op_modify_reply<MOSDRepOpReply, >> 113>(std::tr1::shared_ptr<OpRequest>)+0x21b) [0xa5ae0b] >> 14: >> (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x15b) >> [0xa53edb] >> 15: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, >> ThreadPool::TPHandle&)+0x1cb) [0x84c78b] >> 16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, >> std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3ef) [0x6966ff] >> 17: (OSD::ShardedOpWQ::_process(unsigned int, >> ceph::heartbeat_handle_d*)+0x4e4) [0x696e14] >> 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x71e) >> [0xc264fe] >> 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc29950] >> 20: (()+0x76ba) [0x7f0fec7416ba] >> 21: (clone()+0x6d) [0x7f0feac1541d] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is >> needed to interpret this. >> >> >> Yes, I know it's still hammer, I want to upgrade soon, but I want to >> resolve that issue first. If I lose that PG, I don't worry. >> >> So: What it the best approach? Can I use something like >> ceph-objectstore-tool ... <object> remove-clone-metadata <cloneid> ? I >> assume 29c44 is my Object, but what's the clone od? >> >> >> Best regards, >> >> derjohn >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx -- Andreas John net-lab GmbH | Frankfurter Str. 99 | 63067 Offenbach Geschaeftsfuehrer: Andreas John | AG Offenbach, HRB40832 Tel: +49 69 8570033-1 | Fax: -2 | http://www.net-lab.net Facebook: https://www.facebook.com/netlabdotnet Twitter: https://twitter.com/netlabdotnet _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx