Hi Sage, Thanks for the tip. That was the source of the confusion. Getting all (including compilation) to 12.2.2 solved the issue. Would it be possible to get a new major number for the lib each time the ABI changes (or even generating a new one on each compilation)? RPM relies on that to generate the package dependencies, and let us go through due to that: # rpm -q cta-lib -R | grep rados ... librados.so.2()(64bit) ... # ls -l /usr/lib64/librados* lrwxrwxrwx. 1 root root 17 Dec 18 15:36 /usr/lib64/librados.so.2 -> librados.so.2.0.0 -rwxr-xr-x. 1 root root 1522232 Nov 30 17:15 /usr/lib64/librados.so.2.0.0 lrwxrwxrwx. 1 root root 24 Dec 18 15:37 /usr/lib64/libradosstriper.so.1 -> libradosstriper.so.1.0.0 -rwxr-xr-x. 1 root root 1066440 Nov 30 17:15 /usr/lib64/libradosstriper.so.1.0.0 lrwxrwxrwx. 1 root root 20 Dec 18 15:36 /usr/lib64/librados_tp.so.2 -> librados_tp.so.2.0.0 -rwxr-xr-x. 1 root root 878464 Nov 30 17:15 /usr/lib64/librados_tp.so.2.0.0 That would be really helpful for admins. Thanks for the quick answer! Eric > -----Original Message----- > From: Sage Weil [mailto:sage@xxxxxxxxxxxx] > Sent: Monday, December 18, 2017 15:45 > To: Eric Cano <Eric.Cano@xxxxxxx> > Cc: ceph-devel@xxxxxxxxxxxxxxx > Subject: Re: Segfault when connecting to cluster using Rados API (problem with pick_a_shard()?) > > On Mon, 18 Dec 2017, Eric Cano wrote: > > Hi everyone, > > > > We experience segfaults when connecting to the Rados cluster from our > > application. The problem was first encountered when switching from > > 12.2.0 to 12.2.2. We downgraded to 12.2.1, which helped for some time, > > but we now also encounter the problem in 12.2.1. The current crash is > > for 12.2.1, which we switched for as it seemed to work better. > > It looks/sounds like the ABI for C++ linkage broke between the point > releases. This is really easy to trigger, unfortunately, due to the > design of the C++ interface. A recompile of the application > against the updated headers should fix it. > > I see two problematic commits: > 2ef222a58c3801eaac5a6d52dda2de1ffe37407b (mempool change) > 0048e6a58c7cdf3b3d98df575bc47db8397cd5a9 (buffer::list change) > > I pushed a branch wip-abi-luminous to > https://shaman.ceph.com/builds/ceph/ > > You can either try that build and see if it fixes it, and/or rebuild your > application. Please let us know if either works! > > Thanks- > sage > > > Both are probably straightforward to fix... I'll push a test branch > > > > We had a crash of a command line tool for our application, so the context of the crash is rather simple. The segfault happens in a > Rados thread, where apparently pick_a_shard() delivered a wrong address : > > > > #0 operator+= (__i=1, this=0x8345dbd0a008) at /usr/include/c++/4.8.2/bits/atomic_base.h:420 > > #1 mempool::pool_t::adjust_count (this=0x8345dbd0a000, items=items@entry=1, bytes=bytes@entry=4008) at > /usr/src/debug/ceph-12.2.1/src/common/mempool.cc:85 > > #2 0x00007f544efa0755 in reassign_to_mempool (this=<optimized out>, this=<optimized out>, pool=1026552624) at > /usr/src/debug/ceph-12.2.1/src/common/buffer.cc:207 > > #3 ceph::buffer::list::append (this=this@entry=0x7f543d2ff330, data=0xf64d89 "", len=len@entry=272) at /usr/src/debug/ceph- > 12.2.1/src/common/buffer.cc:1915 > > #4 0x00007f54443c7976 in AsyncConnection::_process_connection (this=this@entry=0xf61740) at /usr/src/debug/ceph- > 12.2.1/src/msg/async/AsyncConnection.cc:962 > > #5 0x00007f54443ca8a8 in AsyncConnection::process (this=0xf61740) at /usr/src/debug/ceph- > 12.2.1/src/msg/async/AsyncConnection.cc:838 > > #6 0x00007f54443dceb9 in EventCenter::process_events (this=this@entry=0xf1dda0, timeout_microseconds=<optimized out>, > timeout_microseconds@entry=30000000, working_dur=working_dur@entry=0x7f543d2ffaf0) at /usr/src/debug/ceph- > 12.2.1/src/msg/async/Event.cc:409 > > #7 0x00007f54443e05ee in NetworkStack::__lambda4::operator() (__closure=0xf4af60) at /usr/src/debug/ceph- > 12.2.1/src/msg/async/Stack.cc:51 > > #8 0x00007f544d5ed2b0 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>) at > ../../../../../libstdc++-v3/src/c++11/thread.cc:84 > > #9 0x00007f544e16ee25 in start_thread (arg=0x7f543d301700) at pthread_create.c:308 > > #10 0x00007f544cd5534d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 > > > > (gdb) frame 1 > > #1 mempool::pool_t::adjust_count (this=0x8345dbd0a000, items=items@entry=1, bytes=bytes@entry=4008) at > /usr/src/debug/ceph-12.2.1/src/common/mempool.cc:85 > > 85 shard->items += items; > > (gdb) l > > 80 } > > 81 > > 82 void mempool::pool_t::adjust_count(ssize_t items, ssize_t bytes) > > 83 { > > 84 shard_t *shard = pick_a_shard(); > > 85 shard->items += items; > > 86 shard->bytes += bytes; > > 87 } > > 88 > > 89 void mempool::pool_t::get_stats( > > (gdb) p shard > > $1 = (mempool::shard_t *) 0x8345dbd0a000 > > (gdb) p *shard > > Cannot access memory at address 0x8345dbd0a000 > > > > The user and main thread is as follows (listing is for frame 4): > > > > (gdb) thread 7 > > [Switching to thread 7 (Thread 0x7f545004e9c0 (LWP 31308))] > > #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 > > 238 62: movq %rax, %r14 > > (gdb) bt > > #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238 > > #1 0x00007f54442a2e3c in WaitUntil (when=..., mutex=..., this=0xe74190) at /usr/src/debug/ceph-12.2.1/src/common/Cond.h:64 > > #2 MonClient::authenticate (this=this@entry=0xe73d58, timeout=300) at /usr/src/debug/ceph-12.2.1/src/mon/MonClient.cc:464 > > #3 0x00007f544ef9058c in librados::RadosClient::connect (this=0xe73d10) at /usr/src/debug/ceph- > 12.2.1/src/librados/RadosClient.cc:299 > > #4 0x00007f544f73a1ee in cta::objectstore::BackendRados::BackendRados (this=0xe08680, logger=..., userId="eoscta", > pool="eoscta_metadata", radosNameSpace="cta-ns") at /usr/src/debug/cta-0.0-85/objectstore/BackendRados.cpp:100 > > #5 0x00007f544f75eabc in cta::objectstore::BackendFactory::createBackend (URL="rados://eoscta@eoscta_metadata:cta-ns", > logger=...) at /usr/src/debug/cta-0.0-85/objectstore/BackendFactory.cpp:42 > > #6 0x000000000041024e in main (argc=2, argv=0x7ffdde466e78) at /usr/src/debug/cta-0.0-85/objectstore/cta-objectstore-dump- > object.cpp:44 > > > > (gdb) l - > > 75 #define TIMESTAMPEDPRINT(A) > > 76 #define NOTIFYLOCKED() > > 77 #define NOTIFYRELEASED() > > 78 #endif > > 79 > > 80 namespace cta { namespace objectstore { > > 81 > > 82 cta::threading::Mutex BackendRados::RadosTimeoutLogger::g_mutex; > > 83 > > 84 BackendRados::BackendRados(log::Logger & logger, const std::string & userId, const std::string & pool, > > (gdb) l > > 85 const std::string &radosNameSpace) : > > 86 m_user(userId), m_pool(pool), m_namespace(radosNameSpace), m_cluster(), m_radosCtxPool() { > > 87 log::LogContext lc(logger); > > 88 cta::exception::Errnum::throwOnReturnedErrno(-m_cluster.init(userId.c_str()), > > 89 "In ObjectStoreRados::ObjectStoreRados, failed to m_cluster.init"); > > 90 try { > > 91 RadosTimeoutLogger rtl; > > 92 cta::exception::Errnum::throwOnReturnedErrno(-m_cluster.conf_read_file(NULL), > > 93 "In ObjectStoreRados::ObjectStoreRados, failed to m_cluster.conf_read_file"); > > 94 rtl.logIfNeeded("In BackendRados::BackendRados(): m_cluster.conf_read_file()", "no object"); > > (gdb) l > > 95 rtl.reset(); > > 96 cta::exception::Errnum::throwOnReturnedErrno(-m_cluster.conf_parse_env(NULL), > > 97 "In ObjectStoreRados::ObjectStoreRados, failed to m_cluster.conf_parse_env"); > > 98 rtl.logIfNeeded("In BackendRados::BackendRados(): m_cluster.conf_parse_env()", "no object"); > > 99 rtl.reset(); > > 100 cta::exception::Errnum::throwOnReturnedErrno(-m_cluster.connect(), > > 101 "In ObjectStoreRados::ObjectStoreRados, failed to m_cluster.connect"); > > 102 rtl.logIfNeeded("In BackendRados::BackendRados(): m_cluster.connect()", "no object"); > > 103 // Create the connection pool. One per CPU hardware thread. > > 104 for (size_t i=0; i<std::thread::hardware_concurrency(); i++) { > > > > Is there anything we do wrong or a bug somewhere in rados? > > > > Thanks for any help, > > > > Eric Cano > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html