Re: radosgw crash - Infernalis

Ben Hines <bhines@xxxxxxxxx> · Wed, 27 Apr 2016 22:09:16 -0700

Got it again - however, the stack is exactly the same, no symbols - debuginfo didn't resolve. Do i need to do something to enable that?
The server in 'debug ms=10' this time, so there is a bit more spew:
   -14> 2016-04-27 21:59:58.811919 7f9e817fa700  1 -- 10.30.1.8:0/3291985349 --> 10.30.2.13:6805/27519 -- osd_op(client.44936150.0:223 obj_delete_at_hint.0000000055 [call timeindex.list] 10.2c88dbcf ack+read+known_if_redirected e100564) v6 -- ?+0 0x7f9f140dc5f0 con 0x7f9f1410ed10
   -13> 2016-04-27 21:59:58.812039 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1 l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
   -12> 2016-04-27 21:59:58.812096 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1 l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
   -11> 2016-04-27 21:59:58.814343 7f9e3f96a700 10 -- 10.30.1.8:0/3291985349 >> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1 l=1 c=0x7f9f1410ed10).reader wants 211 from dispatch throttler 0/104857600
   -10> 2016-04-27 21:59:58.814375 7f9e3f96a700 10 -- 10.30.1.8:0/3291985349 >> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1 l=1 c=0x7f9f1410ed10).aborted = 0
    -9> 2016-04-27 21:59:58.814405 7f9e3f96a700 10 -- 10.30.1.8:0/3291985349 >> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1 l=1 c=0x7f9f1410ed10).reader got message 2 0x7f9ec0009250 osd_op_reply(223 obj_delete_at_hint.0000000055 [call] v0'0 uv1448004 _ondisk_ = 0) v6
    -8> 2016-04-27 21:59:58.814428 7f9e3f96a700  1 -- 10.30.1.8:0/3291985349 <== osd.6 10.30.2.13:6805/27519 2 ==== osd_op_reply(223 obj_delete_at_hint.0000000055 [call] v0'0 uv1448004 _ondisk_ = 0) v6 ==== 196+0+15 (3849172018 0 2149983739) 0x7f9ec0009250 con 0x7f9f1410ed10
    -7> 2016-04-27 21:59:58.814472 7f9e3f96a700 10 -- 10.30.1.8:0/3291985349 dispatch_throttle_release 211 to dispatch throttler 211/104857600
    -6> 2016-04-27 21:59:58.814470 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1 l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
    -5> 2016-04-27 21:59:58.814511 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1 l=1 c=0x7f9f1410ed10).write_ack 2
    -4> 2016-04-27 21:59:58.814528 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1 l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
    -3> 2016-04-27 21:59:58.814607 7f9e817fa700  1 -- 10.30.1.8:0/3291985349 --> 10.30.2.13:6805/27519 -- osd_op(client.44936150.0:224 obj_delete_at_hint.0000000055 [call lock.unlock] 10.2c88dbcf ondisk+write+known_if_redirected e100564) v6 -- ?+0 0x7f9f140dc5f0 con 0x7f9f1410ed10
    -2> 2016-04-27 21:59:58.814718 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1 l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
    -1> 2016-04-27 21:59:58.814778 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1 l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
     0> 2016-04-27 21:59:58.826494 7f9e7e7f4700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f9e7e7f4700

 ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
 1: (()+0x30b0a2) [0x7fa11c5030a2]
 2: (()+0xf100) [0x7fa1183fe100]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
<snip>

On Wed, Apr 27, 2016 at 9:39 PM, Ben Hines <bhines@xxxxxxxxx> wrote:
Yes, CentOS 7.2. Happened twice in a row, both times shortly after a restart, so i expect i'll be able to reproduce it. However, i've now tried a bunch of times and it's not happening again.
In any case i have glibc + ceph-debuginfo installed so we can get more info if it does happen.

thanks!

On Wed, Apr 27, 2016 at 8:40 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
----- Original Message -----

> From: "Karol Mroz" <kmroz@xxxxxxxx>

> To: "Ben Hines" <bhines@xxxxxxxxx>

> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>

> Sent: Wednesday, 27 April, 2016 7:06:56 PM

> Subject: Re:  radosgw crash - Infernalis

>

> On Tue, Apr 26, 2016 at 10:17:31PM -0700, Ben Hines wrote:

> [...]

> > --> 10.30.1.6:6800/10350 -- osd_op(client.44852756.0:79

> > default.42048218.<redacted> [getxattrs,stat,read 0~524288] 12.aa730416

> > ack+read+known_if_redirected e100207) v6 -- ?+0 0x7f49c41880b0 con

> > 0x7f49c4145eb0

> >      0> 2016-04-26 22:07:59.685615 7f49a07f0700 -1 *** Caught signal

> > (Segmentation fault) **

> >  in thread 7f49a07f0700

> >

> >  ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)

> >  1: (()+0x30b0a2) [0x7f4c4907f0a2]

> >  2: (()+0xf100) [0x7f4c44f7a100]

> >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed

> > to interpret this.

>

> Hi Ben,

>

> I sense a pretty badly corrupted stack. From the radosgw-9.2.1 (obtained from

> a downloaded rpm):

>

> 000000000030a810 <_Z13pidfile_writePK11md_config_t@@Base>:

> ...

>   30b09d:       e8 0e 40 e4 ff          callq  14f0b0 <backtrace@plt>

>   30b0a2:       4c 89 ef                mov    %r13,%rdi

>   -------

> ...

>

> So either we tripped backtrace() code from pidfile_write() _or_ we can't

> trust the stack. From the log snippet, it looks that we're far past the point

> at which we would write a pidfile to disk (ie. at process start during

> global_init()).

> Rather, we're actually handling a request and outputting some bit of debug

> message

> via MSDOp::print() and beyond...

It would help to know what binary this is and what OS.

We know the offset into the function is 0x30b0a2 but we don't know which

function yet AFAICT. Karol, how did you arrive at pidfile_write? Purely from

the offset? I'm not sure that would be reliable...

This is a segfault so the address of the frame where we crashed should be the

exact instruction where we crashed. I don't believe a mov from one register to

another that does not involve a dereference ((%r13) as opposed to %r13) can

cause a segfault so I don't think we are on the right instruction but then, as

you say, the stack may be corrupt.

>

> Is this something you're able to easily reproduce? More logs with higher log

> levels

> would be helpful... a coredump with radosgw compiled with -g would be

> excellent :)

Agreed, although if this is an rpm based system it should be sufficient to

run the following.

# debuginfo-install ceph glibc

That may give us the name of the function depending on where we are (if we are

in a library it may require the debuginfo for that library be loaded.

Karol is right that a coredump would be a good idea in this case and will give

us maximum information about the issue you are seeing.

Cheers,

Brad

>

> --

> Regards,

> Karol

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com