Re: Ceph strange issue after adding a cache OSD.

Daznis <daznis@xxxxxxxxx> · Fri, 25 Nov 2016 15:59:13 +0200

I think it's because of these errors:

2016-11-25 14:51:25.644495 7fb73eef8700 -1 log_channel(cluster) log
[ERR] : 14.28 deep-scrub stat mismatch, got 145/144 objects, 0/0
clones, 57/57 dirty, 0/0 omap, 54/53 hit_set_archive, 0/0 whiteouts,
365399477/365399252 bytes,51328/51103 hit_set_archive bytes.

2016-11-25 14:55:56.529405 7f89bae5a700 -1 log_channel(cluster) log
[ERR] : 13.dd deep-scrub stat mismatch, got 149/148 objects, 0/0
clones, 55/55 dirty, 0/0 omap, 63/61 hit_set_archive, 0/0 whiteouts,
360765725/360765503 bytes,55581/54097 hit_set_archive bytes.

I have no clue why they appeared. The cluster was running fine for
months so I have no logs on how it happened. I just enabled them after
"shit hit the fan".

On Fri, Nov 25, 2016 at 12:26 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> Possibly, do you know the exact steps to reproduce? I'm guessing the PG splitting was the cause, but whether this on its own would cause the problem or also needs the introduction of new OSD's at the same time, might make tracing the cause hard.
>
>> -----Original Message-----
>> From: Daznis [mailto:daznis@xxxxxxxxx]
>> Sent: 24 November 2016 19:44
>> To: Nick Fisk <nick@xxxxxxxxxx>
>> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>> Subject: Re:  Ceph strange issue after adding a cache OSD.
>>
>> I will try it, but I wanna see if it stays stable for a few days. Not sure if I should report this bug or not.
>>
>> On Thu, Nov 24, 2016 at 6:05 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> > Can you add them with different ID's, it won't look pretty but might get you out of this situation?
>> >
>> >> -----Original Message-----
>> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
>> >> Of Daznis
>> >> Sent: 24 November 2016 15:43
>> >> To: Nick Fisk <nick@xxxxxxxxxx>
>> >> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>> >> Subject: Re:  Ceph strange issue after adding a cache OSD.
>> >>
>> >> Yes, unfortunately, it is. And the story still continues. I have
>> >> noticed that only 4 OSD are doing this and zapping and readding it
>> >> does not solve the issue. Removing them completely from the cluster solve that issue, but I can't reuse their ID's. If I add another
>> one with the same ID it starts doing the same "funky" crashes. For now the cluster remains "stable" without the OSD's.
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, Nov 23, 2016 at 4:00 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> >> > I take it you have size =2 or min_size=1 or something like that for the cache pool? 1 OSD shouldn’t prevent PG's from recovering.
>> >> >
>> >> > Your best bet would be to see if the PG that is causing the assert
>> >> > can be removed and let the OSD start up. If you are lucky, the PG
>> >> causing the problems might not be one which also has unfound objects,
>> >> otherwise you are likely have to get heavily involved in recovering objects with the object store tool.
>> >> >
>> >> >> -----Original Message-----
>> >> >> From: Daznis [mailto:daznis@xxxxxxxxx]
>> >> >> Sent: 23 November 2016 13:56
>> >> >> To: Nick Fisk <nick@xxxxxxxxxx>
>> >> >> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>> >> >> Subject: Re:  Ceph strange issue after adding a cache OSD.
>> >> >>
>> >> >> No, it's still missing some PGs and objects and can't recover as
>> >> >> it's blocked by that OSD. I can boot the OSD up by removing all
>> >> >> the PG related files from current directory, but that doesn't
>> >> >> solve the missing objects problem. Not really sure if I can move
>> >> >> the object
>> >> back to their place manually, but I will try it.
>> >> >>
>> >> >> On Wed, Nov 23, 2016 at 3:08 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> >> >> > Sorry, I'm afraid I'm out of ideas about that one, that error
>> >> >> > doesn't mean very much to me. The code suggests the OSD is
>> >> >> > trying to
>> >> >> get an attr from the disk/filesystem, but for some reason it
>> >> >> doesn't like that. You could maybe whack the debug logging for OSD
>> >> >> and filestore up to max and try and see what PG/file is accessed
>> >> >> just before the crash, but I'm not sure what the fix would be,
>> >> >> even if
>> >> you manage to locate the dodgy PG.
>> >> >> >
>> >> >> > Does the cluster have all PG's recovered now? Unless anyone else
>> >> >> > can comment, you might be best removing/wiping and then re-
>> >> >> adding the OSD.
>> >> >> >
>> >> >> >> -----Original Message-----
>> >> >> >> From: Daznis [mailto:daznis@xxxxxxxxx]
>> >> >> >> Sent: 23 November 2016 12:55
>> >> >> >> To: Nick Fisk <nick@xxxxxxxxxx>
>> >> >> >> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>> >> >> >> Subject: Re:  Ceph strange issue after adding a cache OSD.
>> >> >> >>
>> >> >> >> Thank you. That helped quite a lot. Now I'm just stuck with one OSD crashing with:
>> >> >> >>
>> >> >> >> osd/PG.cc: In function 'static int
>> >> >> >> PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*,
>> >> >> >> ceph::bufferlist*)' thread 7f36bbdd6880 time
>> >> >> >> 2016-11-23 13:42:43.27
>> >> >> >> 8539
>> >> >> >> osd/PG.cc: 2911: FAILED assert(r > 0)
>> >> >> >>
>> >> >> >>  ceph version 0.94.9 (fe6d859066244b97b24f09d46552afc2071e6f90)
>> >> >> >>  1: (ceph::__ceph_assert_fail(char const*, char const*, int,
>> >> >> >> char
>> >> >> >> const*)+0x85) [0xbde2c5]
>> >> >> >>  2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
>> >> >> >> ceph::buffer::list*)+0x8ba) [0x7cf4da]
>> >> >> >>  3: (OSD::load_pgs()+0x9ef) [0x6bd31f]
>> >> >> >>  4: (OSD::init()+0x181a) [0x6c0e8a]
>> >> >> >>  5: (main()+0x29dd) [0x6484bd]
>> >> >> >>  6: (__libc_start_main()+0xf5) [0x7f36b916bb15]
>> >> >> >>  7: /usr/bin/ceph-osd() [0x661ea9]
>> >> >> >>
>> >> >> >> On Wed, Nov 23, 2016 at 12:31 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> >> >> >> >> -----Original Message-----
>> >> >> >> >> From: Daznis [mailto:daznis@xxxxxxxxx]
>> >> >> >> >> Sent: 23 November 2016 10:17
>> >> >> >> >> To: nick@xxxxxxxxxx
>> >> >> >> >> Cc: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>> >> >> >> >> Subject: Re:  Ceph strange issue after adding a cache OSD.
>> >> >> >> >>
>> >> >> >> >> Hi,
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> Looks like one of my colleagues increased the PG number
>> >> >> >> >> before it finished. I was flushing the whole cache tier and
>> >> >> >> >> it's currently stuck on ~80 GB of data, because of the OSD crashes.
>> >> >> >> >> I will look into the hitset counts and check what can be done.
>> >> >> >> >> Will provide an
>> >> >> >> update if I find anything or fix the issue.
>> >> >> >> >
>> >> >> >> > So I'm guessing when the PG split, the stats/hit_sets are not
>> >> >> >> > how the OSD is expecting them to be and causes the crash. I
>> >> >> >> > would
>> >> >> >> expect this has been caused by the PG splitting rather than
>> >> >> >> introducing extra OSD's. If you manage to get things stable by
>> >> >> >> bumping up the hitset count, then you probably want to try and
>> >> >> >> do a scrub to try and clean up the stats, which may then stop
>> >> >> >> this
>> >> >> happening when the hitset comes round to being trimmed again.
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On Wed, Nov 23, 2016 at 12:04 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> >> >> >> >> > Hi Daznis,
>> >> >> >> >> >
>> >> >> >> >> > I'm not sure how much help I can be, but I will try my best.
>> >> >> >> >> >
>> >> >> >> >> > I think the post-split stats error is probably benign,
>> >> >> >> >> > although I think this suggests you also increased the
>> >> >> >> >> > number of PG's in your cache pool? If so did you do this
>> >> >> >> >> > before or after you added the
>> >> >> >> >> extra OSD's?  This may have been the cause.
>> >> >> >> >> >
>> >> >> >> >> > On to the actual assert, this looks like it's part of the
>> >> >> >> >> > code which trims the tiering hit set's. I don't understand
>> >> >> >> >> > why its crashing out, but it must be related to an invalid
>> >> >> >> >> > or missing hitset I would
>> >> >> >> >> imagine.
>> >> >> >> >> >
>> >> >> >> >> > https://github.com/ceph/ceph/blob/v0.94.9/src/osd/Replicat
>> >> >> >> >> > edP
>> >> >> >> >> > G.c
>> >> >> >> >> > c#L
>> >> >> >> >> > 104
>> >> >> >> >> > 85
>> >> >> >> >> >
>> >> >> >> >> > The only thing I could think of from looking at in the
>> >> >> >> >> > code is that the function loops through all hitsets that
>> >> >> >> >> > are above the max number (hit_set_count). I wonder if
>> >> >> >> >> > setting this number higher would
>> >> >> >> >> mean it won't try and trim any hitsets and let things recover?
>> >> >> >> >> >
>> >> >> >> >> > DISCLAIMER
>> >> >> >> >> > This is a hunch, it might not work or could possibly even
>> >> >> >> >> > make things worse. Otherwise wait for someone who has a better idea to comment.
>> >> >> >> >> >
>> >> >> >> >> > Nick
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >> -----Original Message-----
>> >> >> >> >> >> From: ceph-users
>> >> >> >> >> >> [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
>> >> >> >> >> >> On Behalf Of Daznis
>> >> >> >> >> >> Sent: 23 November 2016 05:57
>> >> >> >> >> >> To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
>> >> >> >> >> >> Subject:  Ceph strange issue after adding a cache OSD.
>> >> >> >> >> >>
>> >> >> >> >> >> Hello,
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> The story goes like this.
>> >> >> >> >> >> I have added another 3 drives to the caching layer. OSDs
>> >> >> >> >> >> were added to crush map one by one after each successful rebalance.
>> >> >> >> >> >> When
>> >> >> >> >> > I
>> >> >> >> >> >> added the last OSD and went away for about an hour I
>> >> >> >> >> >> noticed that it's still not finished rebalancing. Further
>> >> >> >> >> >> investigation
>> >> >> >> >> > showed me
>> >> >> >> >> >> that it one of the older cache SSD was restarting like
>> >> >> >> >> >> crazy before full boot. So I shut it down and waited for
>> >> >> >> >> >> a>> >> >> >> >> rebalance
>> >> >> >> >> > without that
>> >> >> >> >> >> OSD. Less than an hour later I had another 2 OSD
>> >> >> >> >> >> restarting like crazy. I tried running scrubs on the PG's
>> >> >> >> >> >> logs asked me to, but
>> >> >> >> >> > that did
>> >> >> >> >> >> not help. I'm currently stuck with " 8 scrub errors" and a complete dead cluster.
>> >> >> >> >> >>
>> >> >> >> >> >> log_channel(cluster) log [WRN] : pg 15.8d has invalid
>> >> >> >> >> >> (post-split) stats; must scrub before tier agent can
>> >> >> >> >> >> activate
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> I need help with OSD from crashing. Crash log:
>> >> >> >> >> >>      0> 2016-11-23 06:41:43.365602 7f935b4eb700 -1
>> >> >> >> >> >> osd/ReplicatedPG.cc: In function 'void
>> >> >> >> >> >> ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
>> >> >> >> >> >> thread 7f935b4eb700 time 2016-11-23 06:41:43.363067
>> >> >> >> >> >> osd/ReplicatedPG.cc: 10521: FAILED assert(obc)
>> >> >> >> >> >>
>> >> >> >> >> >>  ceph version 0.94.9
>> >> >> >> >> >> (fe6d859066244b97b24f09d46552afc2071e6f90)
>> >> >> >> >> >>  1: (ceph::__ceph_assert_fail(char const*, char const*,
>> >> >> >> >> >> int, char
>> >> >> >> >> >> const*)+0x85) [0xbde2c5]
>> >> >> >> >> >>  2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*,
>> >> >> >> >> >> unsigned
>> >> >> >> >> >> int)+0x75f) [0x87e89f]
>> >> >> >> >> >>  3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f8bb]
>> >> >> >> >> >>  4:
>> >> >> >> >> >> (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x
>> >> >> >> >> >> e3a
>> >> >> >> >> >> )
>> >> >> >> >> >> [0x8a11aa]
>> >> >> >> >> >>  5:
>> >> >> >> >> >> (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>
>> >> >> >> >> >> &,
>> >> >> >> >> >> ThreadPool::TPHandle&)+0x68a) [0x83c37a]
>> >> >> >> >> >>  6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>> >> >> >> >> >> std::tr1::shared_ptr<OpRequest>,
>> >> >> >> >> >> ThreadPool::TPHandle&)+0x405) [0x69af05]
>> >> >> >> >> >>  7: (OSD::ShardedOpWQ::_process(unsigned int,
>> >> >> >> >> >> ceph::heartbeat_handle_d*)+0x333) [0x69b473]
>> >> >> >> >> >>  8: (ShardedThreadPool::shardedthreadpool_worker(unsigned
>> >> >> >> >> >> int)+0x86f) [0xbcd9cf]
>> >> >> >> >> >>  9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
>> >> >> >> >> >> [0xbcfb00]
>> >> >> >> >> >>  10: (()+0x7dc5) [0x7f93b9df4dc5]
>> >> >> >> >> >>  11: (clone()+0x6d) [0x7f93b88d5ced]
>> >> >> >> >> >>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >> I have tried looking with  full debug enabled, but those
>> >> >> >> >> >> logs didn't help me much. I have tried to evict the cache
>> >> >> >> >> >> layer, but some objects are stuck and can't be removed.
>> >> >> >> >> >> Any suggestions would be
>> >> >> >> >> greatly appreciated.
>> >> >> >> >> >> _______________________________________________
>> >> >> >> >> >> ceph-users mailing list
>> >> >> >> >> >> ceph-users@xxxxxxxxxxxxxx
>> >> >> >> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >> >> >
>> >> >> >> >
>> >> >> >
>> >> >
>> >> _______________________________________________
>> >> ceph-users mailing list
>> >> ceph-users@xxxxxxxxxxxxxx
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com