Re: 7915 is not resolved

Dmitry Borodaenko <dborodaenko@xxxxxxxxxxxx> · Mon, 11 Jan 2016 11:06:19 -0800

Fuel 8.0 will support Hammer, you can grab the packages from:
http://mirror.fuel-infra.org/mos-repos/ubuntu/8.0/pool/main/c/ceph/

or, if you build your own packages with the extra patches, grab the
Debian build scripts from:
https://review.fuel-infra.org/#/c/13879/

That would make sure that your packages would work with Fuel.

-- 
Dmitry Borodaenko

On Mon, Jan 11, 2016 at 01:15:50PM -0500, Boris Lukashev wrote:
> Thank you, pulling those into my branch currently and kicking off a build.
> In terms of upgrading to Hammer - the documentation looks straight
> forward enough, but given that this is a Fuel based OpenStack
> deployment, i'm wondering if you've heard of any potential
> compatibility issues from doing so.
> 
> -Boris
> 
> On Mon, Jan 11, 2016 at 12:25 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > On Mon, 11 Jan 2016, Boris Lukashev wrote:
> >> I ran into an incredibly unpleasant loss of a 5 node, 10 OSD ceph
> >> cluster backing our openstack glance and cinder services by just
> >> asking RBD to snapshot one of the volumes.
> >> The conditions under which this occured are as follows - bash script
> >> asking cinder to snapshot RBD volumes in rapid succession (2 of them),
> >> which either caused a nova host (and ceph OSD holder) to crash, or
> >> simply suffered the crash simultaneously. On reboot of the host, RBD
> >> started throwing errors, once all OSDs were restarted, they all fail,
> >> crashing with the following:
> >>
> >>     -1> 2016-01-11 16:37:35.401002 7f16f8449700  5 osd.6 pg_epoch:
> >> 84269 pg[2.2c( empty local-les=84219 n=0 ec=1 les/c 84219/84219
> >> 84218/84218/84193) [6,8] r=0 lpr=84261 crt=0'0 mlcod 0'0 peering]
> >> enter Started/Primary/Peering/GetInfo
> >>      0> 2016-01-11 16:37:35.401057 7f16f7c48700 -1
> >> ./include/interval_set.h: In function 'void interval_set<T>::erase(T,
> >> T) [with T = snapid_t]' thread 7f16f7c48700 time 2016-01-11
> >> 16:37:35.398335
> >> ./include/interval_set.h: 386: FAILED assert(_size >= 0)
> >>
> >>  ceph version 0.80.11-19-g130b0f7 (130b0f748332851eb2e3789e2b2fa4d3d08f3006)
> >>  1: (interval_set<snapid_t>::subtract(interval_set<snapid_t>
> >> const&)+0xb0) [0x79d140]
> >>  2: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x656) [0x772856]
> >>  3: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>,
> >> std::tr1::shared_ptr<OSDMap const>, std::vector<int,
> >> std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&,
> >> int, PG::RecoveryCtx*)+0x282) [0x772c22]
> >>  4: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
> >> PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>,
> >> std::less<boost::intrusive_ptr<PG> >,
> >> std::allocator<boost::intrusive_ptr<PG> > >*)+0x292) [0x6548e2]
> >>  5: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> >
> >> const&, ThreadPool::TPHandle&)+0x20c) [0x6553cc]
> >>  6: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> >
> >> const&, ThreadPool::TPHandle&)+0x18) [0x69c858]
> >>  7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb01) [0xa5ac71]
> >>  8: (ThreadPool::WorkThread::entry()+0x10) [0xa5bb60]
> >>  9: (()+0x8182) [0x7f170def5182]
> >>  10: (clone()+0x6d) [0x7f170c51447d]
> >>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> >> needed to interpret this.
> >>
> >> To me, this looks like the snapshot which was being created when the
> >> nova host died is causing the assert to fail since the snap was never
> >> completed and is broken.
> >>
> >> http://tracker.ceph.com/issues/11493 which appears very similar is
> >> marked as resolved, but with firefly current (deployed via Fuel and
> >> updated in place with 0.80.11 debs) this issue hit us on Saturday.
> >
> > You can try cherry-picking the two commits in wip-11493-b which make the
> > OSD semi-gracefully tolerate this situation.  This is a bug that's been
> > fixed in hammer, but since the inconsistency has already been introduced
> > simply upgrading probably won't resolve it.  Nevertheless, after working
> > around this, I'd encourage you to move to hammer and firefly is at end of
> > life.
> >
> > sage
> >
> >>
> >> Whats the way around this? I imagine commenting out that assert may
> >> cause more damage, but we need to get our OSDs and the RBD data in
> >> them back online. Is there a permanent fix in any branch we can
> >> backport? We built this cluster using Fuel so this affects every
> >> Mirantis user if not every ceph user out there, and the vector into
> >> this catastrophic bug is normal daily operations (snapshot
> >> apparently)....
> >>
> >> Thank you all for looking over this, advice would be greatly appreciated.
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html