OSD Crashes

"Garg, Pankaj" <Pankaj.Garg@xxxxxxxxxxxxxxxxxx> · Fri, 29 Apr 2016 15:54:57 +0000

Hi,
I had a fully functional Ceph cluster with 3 x86 Nodes and 3 ARM64 nodes, each with 12 HDD Drives and 2SSD Drives. All these were initially running Hammer, and then were successfully updated to Infernalis (9.2.0).
I recently deleted all my OSDs and swapped my drives with new ones on the x86 Systems, and the ARM servers were swapped with different ones (keeping drives same).
I again provisioned the OSDs, keeping the same cluster and Ceph versions as before. But now, every time I try to run RADOS bench, my OSDs start crashing (on both ARM and x86 servers).

I’m not sure why this is happening on all 6 systems. On the x86, it’s the same Ceph bits as before, and the only thing different is the new drives.

It’s the same stack (pasted below) on all the OSDs too.
Can anyone provide any clues?

Thanks
Pankaj

  -14> 2016-04-28 08:09:45.423950 7f1ef05b1700  1 -- 192.168.240.117:6820/14377 <== osd.93 192.168.240.116:6811/47080 1236 ==== osd_repop(client.2794263.0:37721 284.6d4 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v 12284'26)
 v1 ==== 981+0+4759 (3923326827 0 3705383247) 0x5634cbabc400 con 0x5634c5168420
   -13> 2016-04-28 08:09:45.423981 7f1ef05b1700  5 -- op tracker -- seq: 29404, time: 2016-04-28 08:09:45.423882, event: header_read, op: osd_repop(client.2794263.0:37721 284.6d4 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head
 v 12284'26)
   -12> 2016-04-28 08:09:45.423991 7f1ef05b1700  5 -- op tracker -- seq: 29404, time: 2016-04-28 08:09:45.423884, event: throttled, op: osd_repop(client.2794263.0:37721 284.6d4 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head
 v 12284'26)
   -11> 2016-04-28 08:09:45.423996 7f1ef05b1700  5 -- op tracker -- seq: 29404, time: 2016-04-28 08:09:45.423942, event: all_read, op: osd_repop(client.2794263.0:37721 284.6d4 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head
 v 12284'26)
   -10> 2016-04-28 08:09:45.424001 7f1ef05b1700  5 -- op tracker -- seq: 29404, time: 0.000000, event: dispatched, op: osd_repop(client.2794263.0:37721 284.6d4 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v 12284'26)
    -9> 2016-04-28 08:09:45.424014 7f1ef05b1700  5 -- op tracker -- seq: 29404, time: 2016-04-28 08:09:45.424014, event: queued_for_pg, op: osd_repop(client.2794263.0:37721 284.6d4 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head
 v 12284'26)
    -8> 2016-04-28 08:09:45.561827 7f1f15799700  5 osd.102 12284 tick_without_osd_lock
    -7> 2016-04-28 08:09:45.973944 7f1f0801a700  1 -- 192.168.240.117:6821/14377 <== osd.73 192.168.240.115:0/26572 1306 ==== osd_ping(ping e12284 stamp 2016-04-28 08:09:45.971751) v2 ==== 47+0+0 (846632602 0 0) 0x5634c8305c00 con 0x5634c58dd760
    -6> 2016-04-28 08:09:45.973995 7f1f0801a700  1 -- 192.168.240.117:6821/14377 --> 192.168.240.115:0/26572 -- osd_ping(ping_reply e12284 stamp 2016-04-28 08:09:45.971751) v2 -- ?+0 0x5634c7ba8000 con 0x5634c58dd760
    -5> 2016-04-28 08:09:45.974300 7f1f0981d700  1 -- 10.18.240.117:6821/14377 <== osd.73 192.168.240.115:0/26572 1306 ==== osd_ping(ping e12284 stamp 2016-04-28 08:09:45.971751) v2 ==== 47+0+0 (846632602 0 0) 0x5634c8129400 con 0x5634c58dcf20
    -4> 2016-04-28 08:09:45.974337 7f1f0981d700  1 -- 10.18.240.117:6821/14377 --> 192.168.240.115:0/26572 -- osd_ping(ping_reply e12284 stamp 2016-04-28 08:09:45.971751) v2 -- ?+0 0x5634c617d600 con 0x5634c58dcf20
    -3> 2016-04-28 08:09:46.174079 7f1f11f92700  0 filestore(/var/lib/ceph/osd/ceph-102) write couldn't open 287.6f9_head/287/ae33fef9/benchmark_data_ceph7_17591_object39895/head: (117) Structure needs cleaning
    -2> 2016-04-28 08:09:46.174103 7f1f11f92700  0 filestore(/var/lib/ceph/osd/ceph-102)  error (117) Structure needs cleaning not handled on operation 0x5634c885df9e (16590.1.0, or op 0, counting from 0)
    -1> 2016-04-28 08:09:46.174109 7f1f11f92700  0 filestore(/var/lib/ceph/osd/ceph-102) unexpected error code
     0> 2016-04-28 08:09:46.178707 7f1f11791700 -1 os/FileStore.cc: In function 'int FileStore::lfn_open(coll_t, const ghobject_t&, bool, FDRef*, Index*)' thread 7f1f11791700 time 2016-04-28 08:09:46.173250
os/FileStore.cc: 335: FAILED assert(!m_filestore_fail_eio || r != -5)

ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5634c02ec7eb]
2: (FileStore::lfn_open(coll_t, ghobject_t const&, bool, std::shared_ptr<FDCache::FD>*, Index*)+0x1191) [0x5634bffb2d01]
3: (FileStore::_write(coll_t, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list const&, unsigned int)+0xf0) [0x5634bffbb7b0]
4: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x2901) [0x5634bffc6f51]
5: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, std::allocator<ObjectStore::Transaction*> >&, unsigned long, ThreadPool::TPHandle*)+0x64) [0x5634bffcc404]
6: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x1a9) [0x5634bffcc5c9]
7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0x5634c02de10e]
8: (ThreadPool::WorkThread::entry()+0x10) [0x5634c02defd0]
9: (()+0x8182) [0x7f1f1f91a182]
10: (clone()+0x6d) [0x7f1f1dc6147d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com