Hi, Looks like one of my colleagues increased the PG number before it finished. I was flushing the whole cache tier and it's currently stuck on ~80 GB of data, because of the OSD crashes. I will look into the hitset counts and check what can be done. Will provide an update if I find anything or fix the issue. On Wed, Nov 23, 2016 at 12:04 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > Hi Daznis, > > I'm not sure how much help I can be, but I will try my best. > > I think the post-split stats error is probably benign, although I think this suggests you also increased the number of PG's in your > cache pool? If so did you do this before or after you added the extra OSD's? This may have been the cause. > > On to the actual assert, this looks like it's part of the code which trims the tiering hit set's. I don't understand why its > crashing out, but it must be related to an invalid or missing hitset I would imagine. > > https://github.com/ceph/ceph/blob/v0.94.9/src/osd/ReplicatedPG.cc#L10485 > > The only thing I could think of from looking at in the code is that the function loops through all hitsets that are above the max > number (hit_set_count). I wonder if setting this number higher would mean it won't try and trim any hitsets and let things recover? > > DISCLAIMER > This is a hunch, it might not work or could possibly even make things worse. Otherwise wait for someone who has a better idea to > comment. > > Nick > > > >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Daznis >> Sent: 23 November 2016 05:57 >> To: ceph-users <ceph-users@xxxxxxxxxxxxxx> >> Subject: Ceph strange issue after adding a cache OSD. >> >> Hello, >> >> >> The story goes like this. >> I have added another 3 drives to the caching layer. OSDs were added to crush map one by one after each successful rebalance. When > I >> added the last OSD and went away for about an hour I noticed that it's still not finished rebalancing. Further investigation > showed me >> that it one of the older cache SSD was restarting like crazy before full boot. So I shut it down and waited for a rebalance > without that >> OSD. Less than an hour later I had another 2 OSD restarting like crazy. I tried running scrubs on the PG's logs asked me to, but > that did >> not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster. >> >> log_channel(cluster) log [WRN] : pg 15.8d has invalid (post-split) stats; must scrub before tier agent can activate >> >> >> I need help with OSD from crashing. Crash log: >> 0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1 >> osd/ReplicatedPG.cc: In function 'void >> ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)' >> thread 7f935b4eb700 time 2016-11-23 06:41:43.363067 >> osd/ReplicatedPG.cc: 10521: FAILED assert(obc) >> >> ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x85) [0xbde2c5] >> 2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned >> int)+0x75f) [0x87e89f] >> 3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb] >> 4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a11aa] >> 5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, >> ThreadPool::TPHandle&)+0x68a) [0x83c37a] >> 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, >> std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69af05] >> 7: (OSD::ShardedOpWQ::_process(unsigned int, >> ceph::heartbeat_handle_d*)+0x333) [0x69b473] >> 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd9cf] >> 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcfb00] >> 10: (()+0x7dc5) [0x7f93b9df4dc5] >> 11: (clone()+0x6d) [0x7f93b88d5ced] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. >> >> >> I have tried looking with full debug enabled, but those logs didn't help me much. I have tried to evict the cache layer, but some >> objects are stuck and can't be removed. Any suggestions would be greatly appreciated. >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com