Re: radosgw crash - Infernalis

Brad Hubbard <bhubbard@xxxxxxxxxx> · Wed, 27 Apr 2016 23:40:40 -0400 (EDT)

----- Original Message -----
> From: "Karol Mroz" <kmroz@xxxxxxxx>
> To: "Ben Hines" <bhines@xxxxxxxxx>
> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
> Sent: Wednesday, 27 April, 2016 7:06:56 PM
> Subject: Re:  radosgw crash - Infernalis
> 
> On Tue, Apr 26, 2016 at 10:17:31PM -0700, Ben Hines wrote:
> [...]
> > --> 10.30.1.6:6800/10350 -- osd_op(client.44852756.0:79
> > default.42048218.<redacted> [getxattrs,stat,read 0~524288] 12.aa730416
> > ack+read+known_if_redirected e100207) v6 -- ?+0 0x7f49c41880b0 con
> > 0x7f49c4145eb0
> >      0> 2016-04-26 22:07:59.685615 7f49a07f0700 -1 *** Caught signal
> > (Segmentation fault) **
> >  in thread 7f49a07f0700
> > 
> >  ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
> >  1: (()+0x30b0a2) [0x7f4c4907f0a2]
> >  2: (()+0xf100) [0x7f4c44f7a100]
> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> > to interpret this.
> 
> Hi Ben,
> 
> I sense a pretty badly corrupted stack. From the radosgw-9.2.1 (obtained from
> a downloaded rpm):
> 
> 000000000030a810 <_Z13pidfile_writePK11md_config_t@@Base>:
> ...
>   30b09d:       e8 0e 40 e4 ff          callq  14f0b0 <backtrace@plt>
>   30b0a2:       4c 89 ef                mov    %r13,%rdi
>   -------
> ...
> 
> So either we tripped backtrace() code from pidfile_write() _or_ we can't
> trust the stack. From the log snippet, it looks that we're far past the point
> at which we would write a pidfile to disk (ie. at process start during
> global_init()).
> Rather, we're actually handling a request and outputting some bit of debug
> message
> via MSDOp::print() and beyond...

It would help to know what binary this is and what OS.

We know the offset into the function is 0x30b0a2 but we don't know which
function yet AFAICT. Karol, how did you arrive at pidfile_write? Purely from
the offset? I'm not sure that would be reliable...

This is a segfault so the address of the frame where we crashed should be the
exact instruction where we crashed. I don't believe a mov from one register to
another that does not involve a dereference ((%r13) as opposed to %r13) can
cause a segfault so I don't think we are on the right instruction but then, as
you say, the stack may be corrupt.

> 
> Is this something you're able to easily reproduce? More logs with higher log
> levels
> would be helpful... a coredump with radosgw compiled with -g would be
> excellent :)

Agreed, although if this is an rpm based system it should be sufficient to
run the following.

# debuginfo-install ceph glibc

That may give us the name of the function depending on where we are (if we are
in a library it may require the debuginfo for that library be loaded.

Karol is right that a coredump would be a good idea in this case and will give
us maximum information about the issue you are seeing.

Cheers,
Brad

> 
> --
> Regards,
> Karol
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com