Re: radosgw crash - Infernalis

Brad Hubbard <bhubbard@xxxxxxxxxx> · Thu, 28 Apr 2016 01:23:14 -0400 (EDT)

----- Original Message ----- 

> From: "Ben Hines" <bhines@xxxxxxxxx>
> To: "Brad Hubbard" <bhubbard@xxxxxxxxxx>
> Cc: "Karol Mroz" <kmroz@xxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
> Sent: Thursday, 28 April, 2016 3:09:16 PM
> Subject: Re:  radosgw crash - Infernalis

> Got it again - however, the stack is exactly the same, no symbols - debuginfo
> didn't resolve. Do i need to do something to enable that?

It's possible we are in a library for which you don't have debuginfo loaded.
Given the list of libraries that radosgw links to getting all debuginfo loaded
may be a daunting prospect. The other possibility is the stack is badly
corrupted as Karol suggested.

Any chance you can capture a core?

You could try setting "ulimit -c unlimited" and starting the osd from the
command line.

HTH,
Brad

> The server in 'debug ms=10' this time, so there is a bit more spew:

> -14> 2016-04-27 21:59:58.811919 7f9e817fa700 1 -- 10.30.1.8:0/3291985349 -->
> 10.30.2.13:6805/27519 -- osd_op(client.44936150.0:223
> obj_delete_at_hint.0000000055 [call timeindex.list] 10.2c88dbcf
> ack+read+known_if_redirected e100564) v6 -- ?+0 0x7f9f140dc5f0 con
> 0x7f9f1410ed10
> -13> 2016-04-27 21:59:58.812039 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
> -12> 2016-04-27 21:59:58.812096 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
> -11> 2016-04-27 21:59:58.814343 7f9e3f96a700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).reader wants 211 from dispatch throttler 0/104857600
> -10> 2016-04-27 21:59:58.814375 7f9e3f96a700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).aborted = 0
> -9> 2016-04-27 21:59:58.814405 7f9e3f96a700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).reader got message 2 0x7f9ec0009250 osd_op_reply(223
> obj_delete_at_hint.0000000055 [call] v0'0 uv1448004 ondisk = 0) v6
> -8> 2016-04-27 21:59:58.814428 7f9e3f96a700 1 -- 10.30.1.8:0/3291985349 <==
> osd.6 10.30.2.13:6805/27519 2 ==== osd_op_reply(223
> obj_delete_at_hint.0000000055 [call] v0'0 uv1448004 ondisk = 0) v6 ====
> 196+0+15 (3849172018 0 2149983739) 0x7f9ec0009250 con 0x7f9f1410ed10
> -7> 2016-04-27 21:59:58.814472 7f9e3f96a700 10 -- 10.30.1.8:0/3291985349
> dispatch_throttle_release 211 to dispatch throttler 211/104857600
> -6> 2016-04-27 21:59:58.814470 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
> -5> 2016-04-27 21:59:58.814511 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).write_ack 2
> -4> 2016-04-27 21:59:58.814528 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
> -3> 2016-04-27 21:59:58.814607 7f9e817fa700 1 -- 10.30.1.8:0/3291985349 -->
> 10.30.2.13:6805/27519 -- osd_op(client.44936150.0:224
> obj_delete_at_hint.0000000055 [call lock.unlock] 10.2c88dbcf
> ondisk+write+known_if_redirected e100564) v6 -- ?+0 0x7f9f140dc5f0 con
> 0x7f9f1410ed10
> -2> 2016-04-27 21:59:58.814718 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
> -1> 2016-04-27 21:59:58.814778 7f9e3fa6b700 10 -- 10.30.1.8:0/3291985349 >>
> 10.30.2.13:6805/27519 pipe(0x7f9f14110010 sd=153 :10861 s=2 pgs=725914 cs=1
> l=1 c=0x7f9f1410ed10).writer: state = open policy.server=0
> 0> 2016-04-27 21:59:58.826494 7f9e7e7f4700 -1 *** Caught signal (Segmentation
> fault) **
> in thread 7f9e7e7f4700

> ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
> 1: (()+0x30b0a2) [0x7fa11c5030a2]
> 2: (()+0xf100) [0x7fa1183fe100]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.

> --- logging levels ---
> <snip>

> On Wed, Apr 27, 2016 at 9:39 PM, Ben Hines < bhines@xxxxxxxxx > wrote:

> > Yes, CentOS 7.2. Happened twice in a row, both times shortly after a
> > restart,
> > so i expect i'll be able to reproduce it. However, i've now tried a bunch
> > of
> > times and it's not happening again.
> 

> > In any case i have glibc + ceph-debuginfo installed so we can get more info
> > if it does happen.
> 

> > thanks!
> 

> > On Wed, Apr 27, 2016 at 8:40 PM, Brad Hubbard < bhubbard@xxxxxxxxxx >
> > wrote:
> 

> > > ----- Original Message -----
> > 
> 
> > > > From: "Karol Mroz" < kmroz@xxxxxxxx >
> > 
> 
> > > > To: "Ben Hines" < bhines@xxxxxxxxx >
> > 
> 
> > > > Cc: "ceph-users" < ceph-users@xxxxxxxxxxxxxx >
> > 
> 
> > > > Sent: Wednesday, 27 April, 2016 7:06:56 PM
> > 
> 
> > > > Subject: Re:  radosgw crash - Infernalis
> > 
> 
> > > >
> > 
> 
> > > > On Tue, Apr 26, 2016 at 10:17:31PM -0700, Ben Hines wrote:
> > 
> 
> > > > [...]
> > 
> 
> > > > > --> 10.30.1.6:6800/10350 -- osd_op(client.44852756.0:79
> > 
> 
> > > > > default.42048218.<redacted> [getxattrs,stat,read 0~524288]
> > > > > 12.aa730416
> > 
> 
> > > > > ack+read+known_if_redirected e100207) v6 -- ?+0 0x7f49c41880b0 con
> > 
> 
> > > > > 0x7f49c4145eb0
> > 
> 
> > > > > 0> 2016-04-26 22:07:59.685615 7f49a07f0700 -1 *** Caught signal
> > 
> 
> > > > > (Segmentation fault) **
> > 
> 
> > > > > in thread 7f49a07f0700
> > 
> 
> > > > >
> > 
> 
> > > > > ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
> > 
> 
> > > > > 1: (()+0x30b0a2) [0x7f4c4907f0a2]
> > 
> 
> > > > > 2: (()+0xf100) [0x7f4c44f7a100]
> > 
> 
> > > > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> > > > > needed
> > 
> 
> > > > > to interpret this.
> > 
> 
> > > >
> > 
> 
> > > > Hi Ben,
> > 
> 
> > > >
> > 
> 
> > > > I sense a pretty badly corrupted stack. From the radosgw-9.2.1
> > > > (obtained
> > > > from
> > 
> 
> > > > a downloaded rpm):
> > 
> 
> > > >
> > 
> 
> > > > 000000000030a810 <_Z13pidfile_writePK11md_config_t@@Base>:
> > 
> 
> > > > ...
> > 
> 
> > > > 30b09d: e8 0e 40 e4 ff callq 14f0b0 <backtrace@plt>
> > 
> 
> > > > 30b0a2: 4c 89 ef mov %r13,%rdi
> > 
> 
> > > > -------
> > 
> 
> > > > ...
> > 
> 
> > > >
> > 
> 
> > > > So either we tripped backtrace() code from pidfile_write() _or_ we
> > > > can't
> > 
> 
> > > > trust the stack. From the log snippet, it looks that we're far past the
> > > > point
> > 
> 
> > > > at which we would write a pidfile to disk (ie. at process start during
> > 
> 
> > > > global_init()).
> > 
> 
> > > > Rather, we're actually handling a request and outputting some bit of
> > > > debug
> > 
> 
> > > > message
> > 
> 
> > > > via MSDOp::print() and beyond...
> > 
> 

> > > It would help to know what binary this is and what OS.
> > 
> 

> > > We know the offset into the function is 0x30b0a2 but we don't know which
> > 
> 
> > > function yet AFAICT. Karol, how did you arrive at pidfile_write? Purely
> > > from
> > 
> 
> > > the offset? I'm not sure that would be reliable...
> > 
> 

> > > This is a segfault so the address of the frame where we crashed should be
> > > the
> > 
> 
> > > exact instruction where we crashed. I don't believe a mov from one
> > > register
> > > to
> > 
> 
> > > another that does not involve a dereference ((%r13) as opposed to %r13)
> > > can
> > 
> 
> > > cause a segfault so I don't think we are on the right instruction but
> > > then,
> > > as
> > 
> 
> > > you say, the stack may be corrupt.
> > 
> 

> > > >
> > 
> 
> > > > Is this something you're able to easily reproduce? More logs with
> > > > higher
> > > > log
> > 
> 
> > > > levels
> > 
> 
> > > > would be helpful... a coredump with radosgw compiled with -g would be
> > 
> 
> > > > excellent :)
> > 
> 

> > > Agreed, although if this is an rpm based system it should be sufficient
> > > to
> > 
> 
> > > run the following.
> > 
> 

> > > # debuginfo-install ceph glibc
> > 
> 

> > > That may give us the name of the function depending on where we are (if
> > > we
> > > are
> > 
> 
> > > in a library it may require the debuginfo for that library be loaded.
> > 
> 

> > > Karol is right that a coredump would be a good idea in this case and will
> > > give
> > 
> 
> > > us maximum information about the issue you are seeing.
> > 
> 

> > > Cheers,
> > 
> 
> > > Brad
> > 
> 

> > > >
> > 
> 
> > > > --
> > 
> 
> > > > Regards,
> > 
> 
> > > > Karol
> > 
> 
> > > >
> > 
> 
> > > > _______________________________________________
> > 
> 
> > > > ceph-users mailing list
> > 
> 
> > > > ceph-users@xxxxxxxxxxxxxx
> > 
> 
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> > > >
> > 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com