Re: Cascading failure on a placement group

Hein-Pieter van Braam <hp@xxxxxx> · Sat, 13 Aug 2016 14:42:23 +0200

Hi,

The timezones on all my systems appear to be the same, I just verified
it by running 'date' on all my boxes.

- HP

On Sat, 2016-08-13 at 12:36 +0000, Goncalo Borges wrote:
> The ticket I mentioned earlier was marked as a duplicate of
> 
> http://tracker.ceph.com/issues/9732
> 
> Cheers
> Goncalo
> 
> From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of
> Goncalo Borges [goncalo.borges@xxxxxxxxxxxxx]
> Sent: 13 August 2016 22:23
> To: Hein-Pieter van Braam; ceph-users
> Subject: Re:  Cascading failure on a placement group
> 
> Hi HP.
> 
> I am just a site admin so my opinion should be validated by proper
> support staff
> 
> Seems really similar to
> http://tracker.ceph.com/issues/14399
> 
> The ticket speaks about timezone difference between osds. Maybe it is
> something worthwhile to check?
> 
> Cheers
> Goncalo
> 
> ________________________________________
> From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of
> Hein-Pieter van Braam [hp@xxxxxx]
> Sent: 13 August 2016 21:48
> To: ceph-users
> Subject:  Cascading failure on a placement group
> 
> Hello all,
> 
> My cluster started to lose OSDs without any warning, whenever an OSD
> becomes the primary for a particular PG it crashes with the following
> stacktrace:
> 
>  ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>  1: /usr/bin/ceph-osd() [0xada722]
>  2: (()+0xf100) [0x7fc28bca5100]
>  3: (gsignal()+0x37) [0x7fc28a6bd5f7]
>  4: (abort()+0x148) [0x7fc28a6bece8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fc28afc29d5]
>  6: (()+0x5e946) [0x7fc28afc0946]
>  7: (()+0x5e973) [0x7fc28afc0973]
>  8: (()+0x5eb93) [0x7fc28afc0b93]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x27a) [0xbddcba]
>  10: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
> int)+0x75f) [0x87e48f]
>  11: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
>  12: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a)
> [0x8a0d1a]
>  13: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
> ThreadPool::TPHandle&)+0x68a) [0x83be4a]
>  14: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405)
> [0x69a5c5]
>  15: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
>  16: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> int)+0x86f)
> [0xbcd1cf]
>  17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
>  18: (()+0x7dc5) [0x7fc28bc9ddc5]
>  19: (clone()+0x6d) [0x7fc28a77eced]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> 
> Has anyone ever seen this? Is there a way to fix this? My cluster is
> in
> rather large disarray at the moment. I have one of the OSDs now in a
> restart loop and that is at least preventing other OSDs from going
> down, but obviously not all other PGs can peer now.
> 
> I'm not sure what else to do at the moment.
> 
> Thank you so much,
> 
> - HP
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com