Re: OSD crashing over and over, taking cluster down

Samuel Just <sjust@xxxxxxxxxx> · Tue, 19 May 2015 10:59:04 -0400 (EDT)

You appear to be using pool snapshots with radosgw, I suspect that's what is causing the issue.  Can you post a longer log?  Preferably with

debug osd = 20
debug filestore = 20
debug ms = 1

from startup to crash on an osd?
-Sam

----- Original Message -----
From: "Daniel Schneller" <daniel.schneller@xxxxxxxxxxxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx
Sent: Tuesday, May 19, 2015 4:29:42 AM
Subject:  OSD crashing over and over, taking cluster down

Last night our Hammer cluster suffered a series of OSD crashes on all cluster nodes. 
We were running Hammer (0.94.1-98-g7df3eb5, built because we had a major problem a week ago which we suspected to be related to bugs we found in the tracker, that were not yet in 0.94.1). In the meantime we downgraded to the official 0.94.1 to rule out we introduced any instability by taking a HEAD version. No change. 

Around 22.00h our users started reporting a down web application, and at the same time we found lots of OSD crashes and restarts. 
The stack trace in the log looks like this on all of them (taken at a later time, when I had increased debug_osd=20): 
2015-05-19 03:38:35.449476 7f7aed260700 -1 osd/osd_types.h: In function 'void ObjectContext::RWState::put_excl(std::list<std::tr1::shared_ptr<OpRequest> >*)' thread 7f7aed260700 time 2015-05-19 03:38:35.445434
osd/osd_types.h: 3167: FAILED assert(state == RWEXCL)

 ceph version 0.94.1-98-g7df3eb5 (7df3eb5e548f7b95ec53d3b9d0e43a863d6fe682)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc2b8b]
 2: /usr/bin/ceph-osd() [0x8b9b05]
 3: (ReplicatedPG::remove_repop(ReplicatedPG::RepGather*)+0xec) [0x84516c]
 4: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x912) [0x857082]
 5: (ReplicatedPG::repop_all_applied(ReplicatedPG::RepGather*)+0x16d) [0x857bbd]
 6: (Context::complete(int)+0x9) [0x6caf09]
 7: (ReplicatedBackend::op_applied(ReplicatedBackend::InProgressOp*)+0x1ec) [0xa081dc]
 8: (Context::complete(int)+0x9) [0x6caf09]
 9: (ReplicatedPG::BlessedContext::finish(int)+0x94) [0x8af634]
 10: (Context::complete(int)+0x9) [0x6caf09]
 11: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0x94) [0x70b764]
 12: (C_ContextsBase<Context, Context>::complete(int)+0x9) [0x6cb759]
 13: (Finisher::finisher_thread_entry()+0x158) [0xaef528]
 14: (()+0x8182) [0x7f7afd540182]
 15: (clone()+0x6d) [0x7f7afbaabfbd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 

Right before the crash, log lines like this were output (see also: http://pastebin.com/W0srF0bW ) 
2015-05-19 03:38:35.445346 7f7aed260700 15 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23683 active+clean] do_osd_op_effects client.141679501 con 0x13fcd180
2015-05-19 03:38:35.445356 7f7aed260700 10 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean]  removing repgather(0x52f63900 34697'23687 rep_tid=656357 committed?=1 applied?=1 lock=2 op=osd_op(client.141679501.0:19791 default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5 [call refcount.put] 81.ce36d9f9 ondisk+write+known_if_redirected e34697) v5)
2015-05-19 03:38:35.445370 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean]    q front is repgather(0x52f63900 34697'23687 rep_tid=656357 committed?=1 applied?=1 lock=2 op=osd_op(client.141679501.0:19791 default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5 [call refcount.put] 81.ce36d9f9 ondisk+write+known_if_redirected e34697) v5)
2015-05-19 03:38:35.445381 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean] remove_repop repgather(0x52f63900 34697'23687 rep_tid=656357 committed?=1 applied?=1 lock=2 op=osd_op(client.141679501.0:19791 default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5 [call refcount.put] 81.ce36d9f9 ondisk+write+known_if_redirected e34697) v5)
2015-05-19 03:38:35.445393 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean]  obc obc(ce36d9f9/default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5/head//81(dne) rwstate(excl n=1 w=0))
2015-05-19 03:38:35.445401 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean]  clone_obc obc(ce36d9f9/default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5/c8//81 rwstate(write n=1 w=0))
2015-05-19 03:38:35.445409 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 34688'23684 mlcod 34688'23684 active+clean]  snapset_obc obc(ce36d9f9/default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5/snapdir//81 rwstate(write n=1 w=0)) 
We cannot pinpoint an exact trigger, but it there _seems_ to be some correlation with larger uploads into the RGW. This is not yet completely validated, but merely a timing related assumption. Could it be that the RGW code causes OSDs to fail either with the big upload alone or in conjunction with parallel other requests? We are seeing more crashes, though, than large uploads, so this remains a guess at best. 

We created this http://tracker.ceph.com/issues/11677 bug ticket, but would be extremely thankful for some quicker help. 
I am in the #ceph IRC channel as dschneller, too. 

Thanks! 
Daniel 

-- 
Daniel Schneller 
Infrastructure Engineer / Developer 

CenterDevice GmbH | Merscheider Straße 1 
| 42699 Solingen 
tel: +49 1754155711 | Deutschland 
daniel.schneller@xxxxxxxxxxxxxxx | www.centerdevice.de 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com