Re: radosgw crash - Infernalis

Ben Hines <bhines@xxxxxxxxx> · Wed, 27 Apr 2016 21:39:15 -0700

Yes, CentOS 7.2. Happened twice in a row, both times shortly after a restart, so i expect i'll be able to reproduce it. However, i've now tried a bunch of times and it's not happening again.
In any case i have glibc + ceph-debuginfo installed so we can get more info if it does happen.

thanks!

On Wed, Apr 27, 2016 at 8:40 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
----- Original Message -----

> From: "Karol Mroz" <kmroz@xxxxxxxx>

> To: "Ben Hines" <bhines@xxxxxxxxx>

> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>

> Sent: Wednesday, 27 April, 2016 7:06:56 PM

> Subject: Re:  radosgw crash - Infernalis

>

> On Tue, Apr 26, 2016 at 10:17:31PM -0700, Ben Hines wrote:

> [...]

> > --> 10.30.1.6:6800/10350 -- osd_op(client.44852756.0:79

> > default.42048218.<redacted> [getxattrs,stat,read 0~524288] 12.aa730416

> > ack+read+known_if_redirected e100207) v6 -- ?+0 0x7f49c41880b0 con

> > 0x7f49c4145eb0

> >      0> 2016-04-26 22:07:59.685615 7f49a07f0700 -1 *** Caught signal

> > (Segmentation fault) **

> >  in thread 7f49a07f0700

> >

> >  ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)

> >  1: (()+0x30b0a2) [0x7f4c4907f0a2]

> >  2: (()+0xf100) [0x7f4c44f7a100]

> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed

> > to interpret this.

>

> Hi Ben,

>

> I sense a pretty badly corrupted stack. From the radosgw-9.2.1 (obtained from

> a downloaded rpm):

>

> 000000000030a810 <_Z13pidfile_writePK11md_config_t@@Base>:

> ...

>   30b09d:       e8 0e 40 e4 ff          callq  14f0b0 <backtrace@plt>

>   30b0a2:       4c 89 ef                mov    %r13,%rdi

>   -------

> ...

>

> So either we tripped backtrace() code from pidfile_write() _or_ we can't

> trust the stack. From the log snippet, it looks that we're far past the point

> at which we would write a pidfile to disk (ie. at process start during

> global_init()).

> Rather, we're actually handling a request and outputting some bit of debug

> message

> via MSDOp::print() and beyond...

It would help to know what binary this is and what OS.

We know the offset into the function is 0x30b0a2 but we don't know which

function yet AFAICT. Karol, how did you arrive at pidfile_write? Purely from

the offset? I'm not sure that would be reliable...

This is a segfault so the address of the frame where we crashed should be the

exact instruction where we crashed. I don't believe a mov from one register to

another that does not involve a dereference ((%r13) as opposed to %r13) can

cause a segfault so I don't think we are on the right instruction but then, as

you say, the stack may be corrupt.

>

> Is this something you're able to easily reproduce? More logs with higher log

> levels

> would be helpful... a coredump with radosgw compiled with -g would be

> excellent :)

Agreed, although if this is an rpm based system it should be sufficient to

run the following.

# debuginfo-install ceph glibc

That may give us the name of the function depending on where we are (if we are

in a library it may require the debuginfo for that library be loaded.

Karol is right that a coredump would be a good idea in this case and will give

us maximum information about the issue you are seeing.

Cheers,

Brad

>

> --

> Regards,

> Karol

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com