7915 is not resolved

Boris Lukashev <blukashev@xxxxxxxxxxxxxxxx> · Mon, 11 Jan 2016 11:52:13 -0500

I ran into an incredibly unpleasant loss of a 5 node, 10 OSD ceph
cluster backing our openstack glance and cinder services by just
asking RBD to snapshot one of the volumes.
The conditions under which this occured are as follows - bash script
asking cinder to snapshot RBD volumes in rapid succession (2 of them),
which either caused a nova host (and ceph OSD holder) to crash, or
simply suffered the crash simultaneously. On reboot of the host, RBD
started throwing errors, once all OSDs were restarted, they all fail,
crashing with the following:

    -1> 2016-01-11 16:37:35.401002 7f16f8449700  5 osd.6 pg_epoch:
84269 pg[2.2c( empty local-les=84219 n=0 ec=1 les/c 84219/84219
84218/84218/84193) [6,8] r=0 lpr=84261 crt=0'0 mlcod 0'0 peering]
enter Started/Primary/Peering/GetInfo
     0> 2016-01-11 16:37:35.401057 7f16f7c48700 -1
./include/interval_set.h: In function 'void interval_set<T>::erase(T,
T) [with T = snapid_t]' thread 7f16f7c48700 time 2016-01-11
16:37:35.398335
./include/interval_set.h: 386: FAILED assert(_size >= 0)

 ceph version 0.80.11-19-g130b0f7 (130b0f748332851eb2e3789e2b2fa4d3d08f3006)
 1: (interval_set<snapid_t>::subtract(interval_set<snapid_t>
const&)+0xb0) [0x79d140]
 2: (PGPool::update(std::tr1::shared_ptr<OSDMap const>)+0x656) [0x772856]
 3: (PG::handle_advance_map(std::tr1::shared_ptr<OSDMap const>,
std::tr1::shared_ptr<OSDMap const>, std::vector<int,
std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&,
int, PG::RecoveryCtx*)+0x282) [0x772c22]
 4: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&,
PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>,
std::less<boost::intrusive_ptr<PG> >,
std::allocator<boost::intrusive_ptr<PG> > >*)+0x292) [0x6548e2]
 5: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> >
const&, ThreadPool::TPHandle&)+0x20c) [0x6553cc]
 6: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> >
const&, ThreadPool::TPHandle&)+0x18) [0x69c858]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb01) [0xa5ac71]
 8: (ThreadPool::WorkThread::entry()+0x10) [0xa5bb60]
 9: (()+0x8182) [0x7f170def5182]
 10: (clone()+0x6d) [0x7f170c51447d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

To me, this looks like the snapshot which was being created when the
nova host died is causing the assert to fail since the snap was never
completed and is broken.

http://tracker.ceph.com/issues/11493 which appears very similar is
marked as resolved, but with firefly current (deployed via Fuel and
updated in place with 0.80.11 debs) this issue hit us on Saturday.

Whats the way around this? I imagine commenting out that assert may
cause more damage, but we need to get our OSDs and the RBD data in
them back online. Is there a permanent fix in any branch we can
backport? We built this cluster using Fuel so this affects every
Mirantis user if not every ceph user out there, and the vector into
this catastrophic bug is normal daily operations (snapshot
apparently)....

Thank you all for looking over this, advice would be greatly appreciated.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html