Re: OSD crash during script, 0.56.4

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 13 May 2013 13:49:22 -0700



On Tue, May 7, 2013 at 9:44 AM, Travis Rhoden <trhoden@xxxxxxxxx> wrote:
> Hey folks,
>
> Saw this crash the other day:
>
>  ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca)
>  1: /usr/bin/ceph-osd() [0x788fba]
>  2: (()+0xfcb0) [0x7f19d1889cb0]
>  3: (gsignal()+0x35) [0x7f19d0248425]
>  4: (abort()+0x17b) [0x7f19d024bb8b]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f19d0b9a69d]
>  6: (()+0xb5846) [0x7f19d0b98846]
>  7: (()+0xb5873) [0x7f19d0b98873]
>  8: (()+0xb596e) [0x7f19d0b9896e]
>  9: (operator new[](unsigned long)+0x47e) [0x7f19d102db1e]
>  10: (ceph::buffer::create(unsigned int)+0x67) [0x834727]
>  11: (ceph::buffer::ptr::ptr(unsigned int)+0x15) [0x834a95]
>  12: (FileStore::read(coll_t, hobject_t const&, unsigned long,
> unsigned long, ceph::buffer::list&)+0x1ae) [0x6fbdde]
>  13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t,
> bool)+0x347) [0x69ac57]
>  14: (PG::chunky_scrub()+0x375) [0x69faf5]
>  15: (PG::scrub()+0x145) [0x6a0e95]
>  16: (OSD::ScrubWQ::_process(PG*)+0xc) [0x6384ec]
>  17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x8297e6]
>  18: (ThreadPool::WorkThread::entry()+0x10) [0x82b610]
>  19: (()+0x7e9a) [0x7f19d1881e9a]
>  20: (clone()+0x6d) [0x7f19d0305cbd]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
>
> Appears to have gone down during a scrub?
>
> I don't see anything interesting in /var/log/syslog or anywhere else
> at the same time.  It's actually the second time I've seen this exact
> stack trace.  First time was reported here...  (was going to insert
> GMane link, but search.gmane.org appears to be down for me).  Well,
> for those inclined, the thread was titled "question about mon memory
> usage", and was also started by me.
>
> Any thoughts?  I do plan to upgrade to 0.56.6 when I can.  I'm a
> little leery of doing it on a production system without a maintenance
> window, though.  When I went from 0.56.3 --> 0.56.4 on a live system,
> a system using the RBD kernel module kpanic'd.  =)

Do you have a core from when this happened? It was indeed during a
scrub, but it didn't fail an assert or anything — looks like maybe it
tried to allocate too much memory or something... :/
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com