Re: Cascading failure on a placement group

Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> · Sun, 14 Aug 2016 00:46:00 +0000

Hi HP

My 2 cents again.

In

> http://tracker.ceph.com/issues/9732

There is a comment from Samuel saying "This...is not resolved! The utime_t->hobject_t mapping is timezone dependent. Needs to be not timezone dependent when generating the archive object names."

The way I read it is that you will get problems if at a given time your timezone has been different (since it is used for archive object names) even if now everything is now in the same timezone. So I guess it could be worthwhile to check if, around the time of the first failures, your timezone wasn't different even if now is ok.

It should be worthwhile to check if timezone is/was different in mind.

Cheers
________________________________________
From: Hein-Pieter van Braam [hp@xxxxxx]
Sent: 13 August 2016 22:42
To: Goncalo Borges; ceph-users
Subject: Re:  Cascading failure on a placement group

Hi,

The timezones on all my systems appear to be the same, I just verified
it by running 'date' on all my boxes.

- HP

On Sat, 2016-08-13 at 12:36 +0000, Goncalo Borges wrote:
> The ticket I mentioned earlier was marked as a duplicate of
>
> http://tracker.ceph.com/issues/9732
>
> Cheers
> Goncalo
>
> From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of
> Goncalo Borges [goncalo.borges@xxxxxxxxxxxxx]
> Sent: 13 August 2016 22:23
> To: Hein-Pieter van Braam; ceph-users
> Subject: Re:  Cascading failure on a placement group
>
> Hi HP.
>
> I am just a site admin so my opinion should be validated by proper
> support staff
>
> Seems really similar to
> http://tracker.ceph.com/issues/14399
>
> The ticket speaks about timezone difference between osds. Maybe it is
> something worthwhile to check?
>
> Cheers
> Goncalo
>
> ________________________________________
> From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of
> Hein-Pieter van Braam [hp@xxxxxx]
> Sent: 13 August 2016 21:48
> To: ceph-users
> Subject:  Cascading failure on a placement group
>
> Hello all,
>
> My cluster started to lose OSDs without any warning, whenever an OSD
> becomes the primary for a particular PG it crashes with the following
> stacktrace:
>
>  ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>  1: /usr/bin/ceph-osd() [0xada722]
>  2: (()+0xf100) [0x7fc28bca5100]
>  3: (gsignal()+0x37) [0x7fc28a6bd5f7]
>  4: (abort()+0x148) [0x7fc28a6bece8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fc28afc29d5]
>  6: (()+0x5e946) [0x7fc28afc0946]
>  7: (()+0x5e973) [0x7fc28afc0973]
>  8: (()+0x5eb93) [0x7fc28afc0b93]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x27a) [0xbddcba]
>  10: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
> int)+0x75f) [0x87e48f]
>  11: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
>  12: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a)
> [0x8a0d1a]
>  13: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
> ThreadPool::TPHandle&)+0x68a) [0x83be4a]
>  14: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405)
> [0x69a5c5]
>  15: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
>  16: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> int)+0x86f)
> [0xbcd1cf]
>  17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
>  18: (()+0x7dc5) [0x7fc28bc9ddc5]
>  19: (clone()+0x6d) [0x7fc28a77eced]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
>
> Has anyone ever seen this? Is there a way to fix this? My cluster is
> in
> rather large disarray at the moment. I have one of the OSDs now in a
> restart loop and that is at least preventing other OSDs from going
> down, but obviously not all other PGs can peer now.
>
> I'm not sure what else to do at the moment.
>
> Thank you so much,
>
> - HP
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com