I think the issue is possibly coming from my Journal drives after upgrade to Infernalis. I have 2 SSDs, which have 6 partitions each for a total of 12 Journals / server. When I create OSDS, I pass the partition names as Journals For e.g. ceph-deploy osd prepare x86Ceph7:/dev/sdd:/dev/sdb1 This works, but since the ownership on the journals is not ceph:ceph, everything fails, until I run chown ceph:ceph /dev/sda4. This change doesn’t persist after reboots. An idea how to fix this. Thanks Pankaj -----Original Message----- From: Somnath Roy [mailto:Somnath.Roy@xxxxxxxxxxx] Sent: Friday, April 29, 2016 9:03 AM To: Garg, Pankaj; Samuel Just Cc: ceph-users@xxxxxxxxxxxxxx Subject: RE: OSD Crashes Check system log and search for the corresponding drive. It should have the information what is failing.. Thanks & Regards Somnath -----Original Message----- From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Garg, Pankaj Sent: Friday, April 29, 2016 8:59 AM To: Samuel Just Cc: ceph-users@xxxxxxxxxxxxxx Subject: Re: OSD Crashes I can see that. I guess what would that be symptomatic of? How is it doing that on 6 different systems and on multiple OSDs? -----Original Message----- From: Samuel Just [mailto:sjust@xxxxxxxxxx] Sent: Friday, April 29, 2016 8:57 AM To: Garg, Pankaj Cc: ceph-users@xxxxxxxxxxxxxx Subject: Re: OSD Crashes Your fs is throwing an EIO on open. -Sam On Fri, Apr 29, 2016 at 8:54 AM, Garg, Pankaj <Pankaj.Garg@xxxxxxxxxxxxxxxxxx> wrote: > Hi, > > I had a fully functional Ceph cluster with 3 x86 Nodes and 3 ARM64 > nodes, each with 12 HDD Drives and 2SSD Drives. All these were > initially running Hammer, and then were successfully updated to Infernalis (9.2.0). > > I recently deleted all my OSDs and swapped my drives with new ones on > the > x86 Systems, and the ARM servers were swapped with different ones > (keeping drives same). > > I again provisioned the OSDs, keeping the same cluster and Ceph > versions as before. But now, every time I try to run RADOS bench, my > OSDs start crashing (on both ARM and x86 servers). > > I’m not sure why this is happening on all 6 systems. On the x86, it’s > the same Ceph bits as before, and the only thing different is the new drives. > > It’s the same stack (pasted below) on all the OSDs too. > > Can anyone provide any clues? > > > > Thanks > > Pankaj > > > > > > > > > > > > -14> 2016-04-28 08:09:45.423950 7f1ef05b1700 1 -- > 192.168.240.117:6820/14377 <== osd.93 192.168.240.116:6811/47080 1236 > ==== > osd_repop(client.2794263.0:37721 284.6d4 > 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v > 12284'26) v1 ==== 981+0+4759 (3923326827 0 3705383247) 0x5634cbabc400 > con 0x5634c5168420 > > -13> 2016-04-28 08:09:45.423981 7f1ef05b1700 5 -- op tracker -- seq: > 29404, time: 2016-04-28 08:09:45.423882, event: header_read, op: > osd_repop(client.2794263.0:37721 284.6d4 > 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v > 12284'26) > > -12> 2016-04-28 08:09:45.423991 7f1ef05b1700 5 -- op tracker -- seq: > 29404, time: 2016-04-28 08:09:45.423884, event: throttled, op: > osd_repop(client.2794263.0:37721 284.6d4 > 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v > 12284'26) > > -11> 2016-04-28 08:09:45.423996 7f1ef05b1700 5 -- op tracker -- seq: > 29404, time: 2016-04-28 08:09:45.423942, event: all_read, op: > osd_repop(client.2794263.0:37721 284.6d4 > 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v > 12284'26) > > -10> 2016-04-28 08:09:45.424001 7f1ef05b1700 5 -- op tracker -- seq: > 29404, time: 0.000000, event: dispatched, op: > osd_repop(client.2794263.0:37721 284.6d4 > 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v > 12284'26) > > -9> 2016-04-28 08:09:45.424014 7f1ef05b1700 5 -- op tracker -- seq: > 29404, time: 2016-04-28 08:09:45.424014, event: queued_for_pg, op: > osd_repop(client.2794263.0:37721 284.6d4 > 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v > 12284'26) > > -8> 2016-04-28 08:09:45.561827 7f1f15799700 5 osd.102 12284 > tick_without_osd_lock > > -7> 2016-04-28 08:09:45.973944 7f1f0801a700 1 -- > 192.168.240.117:6821/14377 <== osd.73 192.168.240.115:0/26572 1306 > ==== osd_ping(ping e12284 stamp 2016-04-28 08:09:45.971751) v2 ==== > 47+0+0 > (846632602 0 0) 0x5634c8305c00 con 0x5634c58dd760 > > -6> 2016-04-28 08:09:45.973995 7f1f0801a700 1 -- > 192.168.240.117:6821/14377 --> 192.168.240.115:0/26572 -- > osd_ping(ping_reply e12284 stamp 2016-04-28 08:09:45.971751) v2 -- ?+0 > 0x5634c7ba8000 con 0x5634c58dd760 > > -5> 2016-04-28 08:09:45.974300 7f1f0981d700 1 -- > 10.18.240.117:6821/14377 <== osd.73 192.168.240.115:0/26572 1306 ==== > osd_ping(ping e12284 stamp 2016-04-28 08:09:45.971751) v2 ==== 47+0+0 > (846632602 0 0) 0x5634c8129400 con 0x5634c58dcf20 > > -4> 2016-04-28 08:09:45.974337 7f1f0981d700 1 -- > 10.18.240.117:6821/14377 --> 192.168.240.115:0/26572 -- > osd_ping(ping_reply > e12284 stamp 2016-04-28 08:09:45.971751) v2 -- ?+0 0x5634c617d600 con > 0x5634c58dcf20 > > -3> 2016-04-28 08:09:46.174079 7f1f11f92700 0 > filestore(/var/lib/ceph/osd/ceph-102) write couldn't open > 287.6f9_head/287/ae33fef9/benchmark_data_ceph7_17591_object39895/head: > (117) Structure needs cleaning > > -2> 2016-04-28 08:09:46.174103 7f1f11f92700 0 > filestore(/var/lib/ceph/osd/ceph-102) error (117) Structure needs > cleaning not handled on operation 0x5634c885df9e (16590.1.0, or op 0, > counting from > 0) > > -1> 2016-04-28 08:09:46.174109 7f1f11f92700 0 > filestore(/var/lib/ceph/osd/ceph-102) unexpected error code > > 0> 2016-04-28 08:09:46.178707 7f1f11791700 -1 os/FileStore.cc: In > function 'int FileStore::lfn_open(coll_t, const ghobject_t&, bool, > FDRef*, Index*)' thread 7f1f11791700 time 2016-04-28 08:09:46.173250 > > os/FileStore.cc: 335: FAILED assert(!m_filestore_fail_eio || r != -5) > > > > ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) > > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x8b) [0x5634c02ec7eb] > > 2: (FileStore::lfn_open(coll_t, ghobject_t const&, bool, > std::shared_ptr<FDCache::FD>*, Index*)+0x1191) [0x5634bffb2d01] > > 3: (FileStore::_write(coll_t, ghobject_t const&, unsigned long, > unsigned long, ceph::buffer::list const&, unsigned int)+0xf0) > [0x5634bffbb7b0] > > 4: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned > long, int, ThreadPool::TPHandle*)+0x2901) [0x5634bffc6f51] > > 5: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, > std::allocator<ObjectStore::Transaction*> >&, unsigned long, > ThreadPool::TPHandle*)+0x64) [0x5634bffcc404] > > 6: (FileStore::_do_op(FileStore::OpSequencer*, > ThreadPool::TPHandle&)+0x1a9) [0x5634bffcc5c9] > > 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) > [0x5634c02de10e] > > 8: (ThreadPool::WorkThread::entry()+0x10) [0x5634c02defd0] > > 9: (()+0x8182) [0x7f1f1f91a182] > > 10: (clone()+0x6d) [0x7f1f1dc6147d] > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com