On Tue, 7 Sep 2010, Leander Yu wrote: > Hi Sage, > We have 3 osd failed first, two of them have the same log as osd3 , > the other one is osd 7. > We just use it as a file system, I wrote a script to repeatedly write > a 10G file and delete it. > The only unusual thing I saw in the log was the journal file was > fulled before the code trace log. > > The other thing is that when I try to restart the failed osd, more osd > crashed on other node. However we didn't get the core dump :( > Thanks. Are you able to reproduce the osd3 crash (FAILED assert(caller_ops.count(e.reqid) == 0)) by restarting osds? If so, can you do so after adding debug ms = 1 debug osd = 20 debug filestore = 20 to your [osd] section of ceph.conf, and sending the osd log somewhere (via url, private email, whatever)? Thanks! sage > > Regards, > Leander Yu. > > > On Tue, Sep 7, 2010 at 2:42 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > > Hi, > > > > This is one we've seen before, issue #326 > > > > http://tracker.newdream.net/issues/326 > > > > Was that the first (and only?) osd to fail? > > > > What kind of workload were you subjecting the cluster to? Just the file > > system? RBD? Anything unusual? > > > > Also, can you confirm what version of the code you were running? The osd > > log at /var/log/ceph/osd.*.log should have a version number and sha1 id, > > something like > > > > ceph version 0.22~rc (3cd9d853cd58c79dc12427be8488e57970abda04) > > > > Thanks! > > sage > > > > > > On Mon, 6 Sep 2010, Leander Yu wrote: > > > >> Hi all, > >> I have setup a 10 osd + 2 mds + 3 mon ceph cluster. it runs ok at > >> beginning. However after one day, some of the osd crashed with > >> following assert fail > >> I am using the unstable trunk. ceph.conf is attached. > >> > >> -------------- osd 3 ----------------- > >> osd/PG.h: In function 'void PG::IndexedLog::index(PG::Log::Entry&)': > >> osd/PG.h:429: FAILED assert(caller_ops.count(e.reqid) == 0) > >> 1: (OSD::_process_pg_info(unsigned int, int, PG::Info&, PG::Log&, > >> PG::Missing&, std::map<int, MOSDPGInfo*, std::less<int>, > >> std::allocator<std::pair<int const, MOSDPGInfo*> > >*, int&)+0xb06) > >> [0x4cf426] > >> 2: (OSD::handle_pg_log(MOSDPGLog*)+0xa9) [0x4cf999] > >> 3: (OSD::_dispatch(Message*)+0x3ed) [0x4e7dfd] > >> 4: (OSD::ms_dispatch(Message*)+0x39) [0x4e86c9] > >> 5: (SimpleMessenger::dispatch_entry()+0x789) [0x46b5f9] > >> 6: (SimpleMessenger::DispatchThread::entry()+0x1c) [0x45849c] > >> 7: (Thread::_entry_func(void*)+0xa) [0x46c0ca] > >> 8: (()+0x6a3a) [0x7f69fd39ea3a] > >> 9: (clone()+0x6d) [0x7f69fc5bc77d] > >> > >> -------------- osd 7 -------------------- > >> osd/ReplicatedPG.cc: In function 'void ReplicatedPG::sub_op_pull(MOSDSubOp*)': > >> osd/ReplicatedPG.cc:3021: FAILED assert(r == 0) > >> 1: (OSD::dequeue_op(PG*)+0x344) [0x4e6fd4] > >> 2: (ThreadPool::worker()+0x28f) [0x5b5a9f] > >> 3: (ThreadPool::WorkThread::entry()+0xd) [0x4f0acd] > >> 4: (Thread::_entry_func(void*)+0xa) [0x46c0ca] > >> 5: (()+0x6a3a) [0x7efff4f12a3a] > >> 6: (clone()+0x6d) [0x7efff413077d] > >> > >> Please let me if you need more information. I still keep the > >> environment for collecting more data for debug. > >> > >> Thanks. > >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > >