Re: OSD crash during script, 0.56.4

Travis Rhoden <trhoden@xxxxxxxxx> · Mon, 13 May 2013 17:13:15 -0400

I'm afraid I don't.  I don't think I looked when it happened, and
searching for one just now came up empty.  :/  If it happens again,
I'll be sure to keep my eye out for one.

FWIW, this particular server (1 out of 5) has 8GB *less* RAM than the
others (one bad stick, it seems), and this has happened twice.  But it
still has 40GB for 12 OSDs, so I think it should be plenty.  Thanks
for responding.

 - Travis

On Mon, May 13, 2013 at 4:49 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> On Tue, May 7, 2013 at 9:44 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
>> Hey folks,
>>
>> Saw this crash the other day:
>>
>>  ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca)
>>  1: /usr/bin/ceph-osd() [0x788fba]
>>  2: (()+0xfcb0) [0x7f19d1889cb0]
>>  3: (gsignal()+0x35) [0x7f19d0248425]
>>  4: (abort()+0x17b) [0x7f19d024bb8b]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f19d0b9a69d]
>>  6: (()+0xb5846) [0x7f19d0b98846]
>>  7: (()+0xb5873) [0x7f19d0b98873]
>>  8: (()+0xb596e) [0x7f19d0b9896e]
>>  9: (operator new[](unsigned long)+0x47e) [0x7f19d102db1e]
>>  10: (ceph::buffer::create(unsigned int)+0x67) [0x834727]
>>  11: (ceph::buffer::ptr::ptr(unsigned int)+0x15) [0x834a95]
>>  12: (FileStore::read(coll_t, hobject_t const&, unsigned long,
>> unsigned long, ceph::buffer::list&)+0x1ae) [0x6fbdde]
>>  13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t,
>> bool)+0x347) [0x69ac57]
>>  14: (PG::chunky_scrub()+0x375) [0x69faf5]
>>  15: (PG::scrub()+0x145) [0x6a0e95]
>>  16: (OSD::ScrubWQ::_process(PG*)+0xc) [0x6384ec]
>>  17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x8297e6]
>>  18: (ThreadPool::WorkThread::entry()+0x10) [0x82b610]
>>  19: (()+0x7e9a) [0x7f19d1881e9a]
>>  20: (clone()+0x6d) [0x7f19d0305cbd]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>>
>> Appears to have gone down during a scrub?
>>
>> I don't see anything interesting in /var/log/syslog or anywhere else
>> at the same time.  It's actually the second time I've seen this exact
>> stack trace.  First time was reported here...  (was going to insert
>> GMane link, but search.gmane.org appears to be down for me).  Well,
>> for those inclined, the thread was titled "question about mon memory
>> usage", and was also started by me.
>>
>> Any thoughts?  I do plan to upgrade to 0.56.6 when I can.  I'm a
>> little leery of doing it on a production system without a maintenance
>> window, though.  When I went from 0.56.3 --> 0.56.4 on a live system,
>> a system using the RBD kernel module kpanic'd.  =)
>
> Do you have a core from when this happened? It was indeed during a
> scrub, but it didn't fail an assert or anything — looks like maybe it
> tried to allocate too much memory or something... :/
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com