On Fri, 6 May 2016, Ana Aviles wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA512 > > Hello, > > We are currently experiencing an unstable cluster on a backup cluster, > we believe it is due to the latest Cephversion 9.2.1 > (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd). OSDs keep on crashing, > segfaulting, which eventually leads some of them to be down, or leave > the cluster on strange scenarios like having unfound objects. > > [Fri May 6 09:45:09 2016] ceph-osd[17588]: segfault at 0 ip > 00007f2bbc5e692a sp 00007f2ba8905060 error 4 in > libtcmalloc.so.4.1.2[7f2bbc5c3000+43000] > [Fri May 6 09:45:09 2016] init: ceph-osd (ceph/72) main process (16509) > killed by SEGV signal > [Fri May 6 09:45:09 2016] init: ceph-osd (ceph/72) main process ended, > respawning > > Our nodes run Ubuntu 14.04.4 LTS, and two of them Ceph version 9.2.0 > (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) while the other two run ceph > version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd). Only on > v.9.2.1. osds keep on segfaulting. On some of them we see: > > ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) > 1: (()+0x7d1aca) [0x7f42100b3aca] > 2: (()+0x10340) [0x7f420e7c6340] > 3: > (tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, > unsigned long, int)+0x103) [0x7f420e9f7923] > 4: > (tcmalloc::ThreadCache::ListTooLong(tcmalloc::ThreadCache::FreeList*, > unsigned long)+0x1b) [0x7f420e9f79db] > 5: (tc_free()+0x1f8) [0x7f420ea052c8] > 6: (()+0x50451) [0x7f420e4cc451] > 7: (PK11_FreeSlotList()+0x9) [0x7f420e4cc479] > 8: (PK11_GetAllTokens()+0x1cc) [0x7f420e4cec5c] > 9: (PK11_GetBestSlotMultipleWithAttributes()+0x23b) [0x7f420e4cf06b] > 10: (PK11_GetBestSlot()+0x1f) [0x7f420e4cf0df] > 11: (CryptoAES::get_key_handler(ceph::buffer::ptr const&, > std::string&)+0x1f4) [0x7f42100d3484] > 12: (CryptoKey::_set_secret(int, ceph::buffer::ptr const&)+0xcc) > [0x7f42100d25fc] > 13: (CryptoKey::decode(ceph::buffer::list::iterator&)+0xa2) > [0x7f42100d2922] > 14: (void decode_decrypt_enc_bl<CephXServiceTicket>(CephContext*, > CephXServiceTicket&, CryptoKey, ceph::buffer::list&, > std::string&)+0x4a5) [0x7f42100c0f05] > 15: (int decode_decrypt<CephXServiceTicket>(CephContext*, > CephXServiceTicket&, CryptoKey const&, ceph::buffer::list::iterator&, > std::string&)+0x1cf) [0x7f42100c12df] > 16: (CephXTicketHandler::verify_service_ticket_reply(CryptoKey&, > ceph::buffer::list::iterator&)+0xdb) [0x7f42100bb5ab] > 17: (CephXTicketManager::verify_service_ticket_reply(CryptoKey&, > ceph::buffer::list::iterator&)+0x122) [0x7f42100bd442] > 18: (CephxClientHandler::handle_response(int, > ceph::buffer::list::iterator&)+0xef4) [0x7f421024a2b4] > 19: (MonClient::handle_auth(MAuthReply*)+0xce) [0x7f421014589e] > 20: (MonClient::ms_dispatch(Message*)+0x297) [0x7f4210147b27] > 21: (DispatchQueue::entry()+0x63a) [0x7f421025683a] > 22: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f4210180ecd] > 23: (()+0x8182) [0x7f420e7be182] > 24: (clone()+0x6d) [0x7f420cb0547d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > Which is the same error reported 8 days ago > http://tracker.ceph.com/issues/15628 > > > Here is the log of one of the down OSDs: http://pastebin.com/dcHKrE8f > > Now we would like to downgrade to version 9.2.0 all nodes, since we keep > on having osds down and sometimes OSDs with corrupted metadata. However, > it looks like it is not possible to downgrade a Ceph version? Our goal is to make downgrades within a stable series possible, but we have not tested them for infernalis. There was one fix in the auth code that may affect this. I pushed a branch that backports it to infernalis and pushed a wip-auth-infernalis branch. The packages should show up on gitbuilder.ceph.com in an hour or so. Can you give those a try? http://gitbuilder.ceph.com/ceph-deb-trusty-x86_64-basic/ref/wip-auth-infernalis We haven't seen this crash at all in any of our testing. :( > Besides that, we also have "wrong node!" messages on most of our osd > logs (on both nodes with v9.2.1 and v9.2.0). We don't know if it is > related, or if we should also have a look at that. > > 2016-05-05 15:30:16.994946 7f7272cc3700 0 -- > [2a00:c6c0:0:120::201]:6893/5870 >> [2a00:c6c0:0:120::202]:6807/10502 > pipe(0x7f72cc272000 sd=24 :53006 s=1 pgs=309 cs=19 l=0 > c=0x7f72d23f31e0).connect claims to be [2a00:c6c0:0:120::202]:6807/4013 > not [2a00:c6c0:0:120::202]:6807/10502 - wrong node! These are harmless--they're just there because OSDs are restarting and reusing some of the same ports. sage > > Thanks! > > > > - -- > Ana Avilés > Greenhost - sustainable hosting & digital security > E: ana@xxxxxxxxxxxx > T: +31 20 4890444 > W: https://greenhost.nl > -----BEGIN PGP SIGNATURE----- > > iQEcBAEBCgAGBQJXLGxZAAoJEOUdSHwFo2bgT7IIAIMHE5x6Qhqn/nskuB1k2QJl > NWC/nR0Cmlc5OSEoAHu1fZKMtnP8XAfH+zW+MO7xNpgDks5zCZ0oLXPo9hYndGNN > yVgUMDcm7hw8saYiRumsEr84ER2Hsv7kMcAdEAFyt4IJ056WRUGduFBWmc6VkRx5 > OtOqmlHKpnX+BW8UPGoNXD6JjmAog38+rUszdkQmn1WpvG+aBx/plQlcZXNnfIMM > mclsDzTkSO5LStVYSNaBfp7OpYiXwESVjz4X73ZnoTX61q0cOfL4W9Kvp+xeXfyV > RkRhPLXuffrX9bV5HVRE4zpexXy781o2ugAh5ZwCFgGSJgkRJM+IxA6OAqSo+Kg= > =sDhn > -----END PGP SIGNATURE----- > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > >