Re: OSDs continuously crashing with v9.2.1

Ana Aviles <ana@xxxxxxxxxxxx> · Fri, 6 May 2016 18:29:01 +0200



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512


On 05/06/2016 02:45 PM, Sage Weil wrote:
> On Fri, 6 May 2016, Ana Aviles wrote: Hello,
> 
> We are currently experiencing an unstable cluster on a backup
> cluster, we believe it is due to the latest Cephversion 9.2.1 
> (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd). OSDs keep on crashing, 
> segfaulting, which eventually leads some of them to be down, or
> leave the cluster on strange scenarios like having unfound
> objects.
> 
> [Fri May  6 09:45:09 2016] ceph-osd[17588]: segfault at 0 ip 
> 00007f2bbc5e692a sp 00007f2ba8905060 error 4 in 
> libtcmalloc.so.4.1.2[7f2bbc5c3000+43000] [Fri May  6 09:45:09 2016]
> init: ceph-osd (ceph/72) main process (16509) killed by SEGV
> signal [Fri May  6 09:45:09 2016] init: ceph-osd (ceph/72) main
> process ended, respawning
> 
> Our nodes run Ubuntu 14.04.4 LTS, and two of them Ceph version
> 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) while the other
> two run ceph version 9.2.1
> (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd). Only on v.9.2.1. osds
> keep on segfaulting. On some of them we see:
> 
> ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) 1:
> (()+0x7d1aca) [0x7f42100b3aca] 2: (()+0x10340) [0x7f420e7c6340] 3: 
> (tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::F
reeList*,
>
> 
unsigned long, int)+0x103) [0x7f420e9f7923]
> 4: 
> (tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*,
>
> 
unsigned long)+0x1b) [0x7f420e9f79db]
> 5: (tc_free()+0x1f8) [0x7f420ea052c8] 6: (()+0x50451)
> [0x7f420e4cc451] 7: (PK11_FreeSlotList()+0x9) [0x7f420e4cc479] 8:
> (PK11_GetAllTokens()+0x1cc) [0x7f420e4cec5c] 9:
> (PK11_GetBestSlotMultipleWithAttributes()+0x23b) [0x7f420e4cf06b] 
> 10: (PK11_GetBestSlot()+0x1f) [0x7f420e4cf0df] 11:
> (CryptoAES::get_key_handler(ceph::buffer::ptr const&, 
> std::string&)+0x1f4) [0x7f42100d3484] 12:
> (CryptoKey::_set_secret(int, ceph::buffer::ptr const&)+0xcc) 
> [0x7f42100d25fc] 13:
> (CryptoKey::decode(ceph::buffer::list::iterator&)+0xa2) 
> [0x7f42100d2922] 14: (void
> decode_decrypt_enc_bl<CephXServiceTicket>(CephContext*, 
> CephXServiceTicket&, CryptoKey, ceph::buffer::list&, 
> std::string&)+0x4a5) [0x7f42100c0f05] 15: (int
> decode_decrypt<CephXServiceTicket>(CephContext*, 
> CephXServiceTicket&, CryptoKey const&,
> ceph::buffer::list::iterator&, std::string&)+0x1cf)
> [0x7f42100c12df] 16:
> (CephXTicketHandler::verify_service_ticket_reply(CryptoKey&, 
> ceph::buffer::list::iterator&)+0xdb) [0x7f42100bb5ab] 17:
> (CephXTicketManager::verify_service_ticket_reply(CryptoKey&, 
> ceph::buffer::list::iterator&)+0x122) [0x7f42100bd442] 18:
> (CephxClientHandler::handle_response(int, 
> ceph::buffer::list::iterator&)+0xef4) [0x7f421024a2b4] 19:
> (MonClient::handle_auth(MAuthReply*)+0xce) [0x7f421014589e] 20:
> (MonClient::ms_dispatch(Message*)+0x297) [0x7f4210147b27] 21:
> (DispatchQueue::entry()+0x63a) [0x7f421025683a] 22:
> (DispatchQueue::DispatchThread::entry()+0xd) [0x7f4210180ecd] 23:
> (()+0x8182) [0x7f420e7be182] 24: (clone()+0x6d) [0x7f420cb0547d] 
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
> needed to interpret this.
> 
> Which is the same error reported 8 days ago 
> http://tracker.ceph.com/issues/15628
> 
> 
> Here is the log of one of the down OSDs:
> http://pastebin.com/dcHKrE8f
> 
> Now we would like to downgrade to version 9.2.0 all nodes, since we
> keep on having osds down and sometimes OSDs with corrupted
> metadata. However, it looks like it is not possible to downgrade a
> Ceph version?
> 
>> Our goal is to make downgrades within a stable series possible,
>> but we have not tested them for infernalis.
> 
>> There was one fix in the auth code that may affect this.  I
>> pushed a branch that backports it to infernalis and pushed a
>> wip-auth-infernalis branch. The packages should show up on
>> gitbuilder.ceph.com in an hour or so.  Can you give those a try?
> 
>> http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/wip-auth-
infernalis

Thanks!
>> 
We just installed them and it's running so far so good. We'll
keep an eye on it and report if we see them happening again.

> 
>> We haven't seen this crash at all in any of our testing.  :(
> 
> Besides that, we also have "wrong node!" messages on most of our
> osd logs (on both nodes with v9.2.1 and v9.2.0). We don't know if
> it is related, or if we should also have a look at that.
> 
> 2016-05-05 15:30:16.994946 7f7272cc3700  0 -- 
> [2a00:c6c0:0:120::201]:6893/5870 >>
> [2a00:c6c0:0:120::202]:6807/10502 pipe(0x7f72cc272000 sd=24 :53006
> s=1 pgs=309 cs=19 l=0 c=0x7f72d23f31e0).connect claims to be
> [2a00:c6c0:0:120::202]:6807/4013 not
> [2a00:c6c0:0:120::202]:6807/10502 - wrong node!
> 
>> These are harmless--they're just there because OSDs are
>> restarting and reusing some of the same ports.
> 
>> sage
> 
> 
> 
> Thanks!
> 
> 
> 
>> 
>> -- To unsubscribe from this list: send the line "unsubscribe
>> ceph-devel" in the body of a message to
>> majordomo@xxxxxxxxxxxxxxx More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>> 

- -- 
Ana Avilés
Greenhost - sustainable hosting & digital security
E: ana@xxxxxxxxxxxx
T: +31 20 4890444
W: https://greenhost.nl
-----BEGIN PGP SIGNATURE-----

iQEcBAEBCgAGBQJXLMZFAAoJEOUdSHwFo2bgw9IH/iCforwStrJFIO3i33QXuu0b
N0HgmInlUc0DvkrurysrK+3wcK2jAnkgIoy3ESN+pj62X9QlSiHcQGhEknLoW0JS
NOzh7yB2srX6UQKKqm6RU7E7lQ9eO1OK1rQRFi4q1mVQU+y0yOk0YS6JXm8/+4gf
rRN1p7LRHEVIQF9X2zn+FmXHP9z22LCHX4/8RDwnx4uEYwhSijBDPq4pmxFgWABJ
OpWs3/HxZuQZpnDhKHfzizK1LpWR27paZjpwiVC2gYsed8V+Nat5mmsRs9cl2VIM
N+OlDHVSklPGa/QytZzFVhIOs/bY1VwigmdSQ51SSztWWbmC4ddK2kJU+PKMtUQ=
=61AE
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html