Hi James, On Thu, 12 Sep 2013, James Harper wrote: > I'm still getting crashes with tapdisk rbd. Most of the time it crashes gdb if I try. When I do get something, the crashing thread is always segfaulting in pthread_cond_wait and the stack is always corrupt: > > (gdb) bt > #0 0x00007faae20c52d7 in pthread_cond_wait@@GLIBC_2.3.2 () from remote:/lib/x86_64-linux-gnu/libpthread.so.0 > #1 0x00c1c435c10e782c in ?? () > #2 0xe0bc294e52000010 in ?? () > #3 0x08481380b00400fa in ?? () > #4 0x3326aab400000000 in ?? () > #5 0x0000000008001e00 in ?? () > #6 0x000004043326aab4 in ?? () > #7 0x7aef0100040595ef in ?? () > > When I examine the memory on the stack I get like: > > 0x7faae3cc7c10: 0x00 0x00 0x00 0x00 0xb4 0xaa 0x26 0x32 > 0x7faae3cc7c18: 0x00 0x1e 0x00 0x08 0x00 0x00 0x00 0x00 > 0x7faae3cc7c20: 0xb4 0xaa 0x26 0x32 0x04 0x04 0x00 0x00 > 0x7faae3cc7c28: 0xef 0x95 0x05 0x04 0x00 0x01 0xef 0x79 > 0x7faae3cc7c30: 0x06 0x04 0x00 0x00 0x00 0x01 0x2b 0xf8 > 0x7faae3cc7c38: 0x2c 0x78 0x0e 0xc1 0x35 0xc4 0xc1 0x00 > 0x7faae3cc7c40: 0x10 0x00 0x00 0x52 0x4e 0x29 0xbc 0xe0 > 0x7faae3cc7c48: 0xfa 0x00 0x04 0xb0 0x80 0x13 0x48 0x08 > 0x7faae3cc7c50: 0x00 0x00 0x00 0x00 0xb4 0xaa 0x26 0x33 > 0x7faae3cc7c58: 0x00 0x1e 0x00 0x08 0x00 0x00 0x00 0x00 > 0x7faae3cc7c60: 0xb4 0xaa 0x26 0x33 0x04 0x04 0x00 0x00 > 0x7faae3cc7c68: 0xef 0x95 0x05 0x04 0x00 0x01 0xef 0x7a > 0x7faae3cc7c70: 0x06 0x04 0x00 0x00 0x00 0x01 0x2c 0x38 > 0x7faae3cc7c78: 0x2c 0xb8 0x0e 0xc1 0x35 0xc5 0xc1 0x00 > 0x7faae3cc7c80: 0x10 0x00 0x00 0x52 0x4e 0x29 0xbc 0xe0 > 0x7faae3cc7c88: 0xfa 0x00 0x04 0xb0 0x80 0x13 0x5c 0x08 > > And I see very similar byte patterns in a tcpdump taken at the time of the crash, so I'm wondering if data read from or to be written to the network is overflowing a buffer somewhere and corrupting the stack. > > Does ceph use a magic start of message number or something that I could identify? There isn't a simple magic string I can point to except for struct ceph_msg_header, but I doubt that will help, since it is reading the headers and message bodies into different buffers. I forget: are you able to reproduce any of this with debugging enabled? I would suggest adding pointers to the debug statements in msg/Pipe.cc to narrow down what is using some of this memory. You might also want to just look at read_message, connect, and accept in Pipe.cc as I think those are the only places where data is read off the network into a buffer/struct on the stack. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html