RE: debugging librbd async

James Harper <james.harper@xxxxxxxxxxxxxxxx> · Fri, 16 Aug 2013 05:38:09 +0000

> 
> On Fri, 16 Aug 2013, James Harper wrote:
> > I'm testing out the tapdisk rbd that Sylvain wrote under Xen, and have
> > been having all sorts of problems as the tapdisk process is segfaulting. To
> > make matters worse, any attempt to use gdb on the resulting core just tells
> > me it can't find the threads ('generic error'). Google tells me that I can get
> > around this error by linking the main exe (tapdisk) with libpthread, but that
> > doesn't help.
> >
> > With strategic printf's I have confirmed that in most cases the crash
> > happens after a call to rbd_aio_read or rbd_aio_write and before the
> > callback is called. Given the async nature of tapdisk it's impossible to be sure
> > but I'm confident that the crash is not happening in any of the tapdisk code.
> > It's possible that there is an off-by-one error in a buffer somewhere with the
> > corruption showing up later but there really isn't a lot of code there and I've
> > been over it very closely and it appears quite sound.
> >
> > I have also tested for multiple complete's for the same request, and
> > corrupt pointers being passed into the completion routine, and nothing
> > shows up there either.
> >
> > In most cases there is nothing pre-empting the crash, aside from a
> > tendency to seemingly crash more often when the cluster is disturbed (eg a
> > mon node is rebooted). I have one VM which will be unbootable for long
> > periods of time with the crash happening during boot, typically when
> > postgres starts. This can be reproduced for hours and is useful for debugging,
> > but then suddenly the problem goes away spontaneously and I can no longer
> > reproduce it even after hundreds of reboots.
> >
> > I'm using Debian and the problem exists with both the latest cuttlefish and
> > dumpling deb's.
> >
> > So... does librbd have any internal self-checking options I can enable? If I'm
> > going to start injecting printf's around the place, can anyone suggest what
> > code paths are most likely to be causing the above?
> 
> This is usually about the time when we trying running things under
> valgrind.  Is that an option with tapdisk?

Never used it before. I guess I can find out pretty easy, I'll try that next.

> Of course, the old standby is to just crank up the logging detail and try
> to narrow down where the crash happens.  Have you tried that yet?

I haven't touched the rbd code. Is increased logging a compile-time option or a config option?

> 
> There is a probable issue with aio_flush and caching enabled that Mike
> Dawson is trying to reproduce.  Are you running with caching on or off?

I have not enabled caching, and I believe it's disabled by default.

Thanks

James
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html