Re: troubleshooting librados error with concurrent requests

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thank you! I will try what you suggested.

Here is the full backtrace:
#0  0x00007f816529b360 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f80c716bb82 in Cond::Wait (this=this@entry=0x7f7f977dce20, mutex=...) at ./common/Cond.h:56
#2  0x00007f80c717656f in librados::IoCtxImpl::operate_read (this=this@entry=0x7f7ed40b4190, oid=..., o=o@entry=0x7f7f977dd050, pbl=pbl@entry=0x0, flags=flags@entry=0)
    at librados/IoCtxImpl.cc:725
#3  0x00007f80c7180032 in librados::IoCtxImpl::stat (this=0x7f7ed40b4190, oid=..., psize=psize@entry=0x7f7f977dd198, pmtime=pmtime@entry=0x7f7f977dd1a0) at librados/IoCtxImpl.cc:1238
#4  0x00007f80c7142606 in librados::IoCtx::stat (this=0x7f7f977dd290, oid=..., psize=0x7f7f977dd198, pmtime=0x7f7f977dd1a0) at librados/librados.cc:1260

My custom tensorflow kernels

#5  0x00007f80d0e8fd23 in tensorflow::CephReaderOp::CephReadObject (this=0x7f7e9c0607d0, file_key="ERR174324_1_05_20600000", name_space="ERR174324_1_05", ref_buffer=0x7f7f18000cd0, 
    io_ctx=...) at tensorflow/contrib/ceph_src/kernels/ceph_reader_op.cc:104
#6  0x00007f80d0e8f854 in tensorflow::CephReaderOp::Compute (this=0x7f7e9c0607d0, ctx=0x7f7f977dd8f0) at tensorflow/contrib/ceph_src/kernels/ceph_reader_op.cc:64

Tensorflow runtime

#7  0x00007f80d873d5c1 in tensorflow::ThreadPoolDevice::Compute (this=0x7f812077a270, op_kernel=0x7f7e9c0607d0, context=0x7f7f977dd8f0)
    at tensorflow/core/common_runtime/threadpool_device.cc:60
#8  0x00007f80d86c80e4 in tensorflow::(anonymous namespace)::ExecutorState::Process (this=0x7f78d80014b0, tagged_node=..., scheduled_usec=0)
    at tensorflow/core/common_runtime/executor.cc:1658
#9  0x00007f80d86ca634 in tensorflow::(anonymous namespace)::ExecutorState::<lambda()>::operator()(void) const (__closure=0x7f7ed0000ee0) at tensorflow/core/common_runtime/executor.cc:2094
#10 0x00007f80d86d1882 in std::_Function_handler<void(), tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(const TaggedNodeSeq&, tensorflow::(anonymous namespace)::ExecutorState::TaggedNodeReadyQueue*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/5/functional:1871
#11 0x00007f80d7f449f6 in std::function<void ()>::operator()() const (this=0x7f7ed0030ef0) at /usr/include/c++/5/functional:2267

Tensorflow threadpool

#12 0x00007f80d8188c86 in tensorflow::thread::EigenEnvironment::ExecuteTask (this=0x7f81209e1778, t=...) at tensorflow/core/lib/core/threadpool.cc:83
#13 0x00007f80d818b6f4 in Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop (this=0x7f81209e1770, thread_id=46)
    at external/eigen_archive/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h:232
#14 0x00007f80d8189b46 in Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::NonBlockingThreadPoolTempl(int, bool, tensorflow::thread::EigenEnvironment)::{lambda()#1}::operator()() const () at external/eigen_archive/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h:65
#15 0x00007f80d818cd14 in std::_Function_handler<void (), Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::NonBlockingThreadPoolTempl(int, bool, tensorflow::thread::EigenEnvironment)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/5/functional:1871
#16 0x00007f80d7f449f6 in std::function<void ()>::operator()() const (this=0x7f8120aac960) at /usr/include/c++/5/functional:2267
#17 0x00007f80d81889d3 in tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}::operator()() const (__closure=0x7f8120aac960)
    at tensorflow/core/lib/core/threadpool.cc:56
#18 0x00007f80d818a970 in std::_Function_handler<void (), tensorflow::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (
    __functor=...) at /usr/include/c++/5/functional:1871
#19 0x00007f80d7f449f6 in std::function<void ()>::operator()() const (this=0x7f8120aac9b8) at /usr/include/c++/5/functional:2267
#20 0x00007f80d81d1446 in std::_Bind_simple<std::function<void ()> ()>::_M_invoke<>(std::_Index_tuple<>) (this=0x7f8120aac9b8) at /usr/include/c++/5/functional:1531
#21 0x00007f80d81d13af in std::_Bind_simple<std::function<void ()> ()>::operator()() (this=0x7f8120aac9b8) at /usr/include/c++/5/functional:1520
#22 0x00007f80d81d134e in std::thread::_Impl<std::_Bind_simple<std::function<void ()> ()> >::_M_run() (this=0x7f8120aac9a0) at /usr/include/c++/5/thread:115
#23 0x00007f80d6c66c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#24 0x00007f81652956ba in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#25 0x00007f8164fcb41d in clone () from /lib/x86_64-linux-gnu/libc.so.6

On Tue, May 1, 2018 at 7:47 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
Can you provide the full backtrace? It kinda looks like you've left something out.

In general though, a Wait inside of an operate call just means the thread has submitted its request and is waiting for the answer to come back. It could be blocked locally or remotely. If it's blocked remotely, the OSDs should be reporting to the mon/mgr that they have slow requests, which you can observe in "ceph -w" or whatever. If it's local, hrmm, not sure the easiest way to debug without just cranking up logging.
Generically, I'd use the admin socket on your clients to look at what the status of in-flight requests is, and to check the value of the throttle limits in the perfcounter. If the requests are being handled slowly on the OSD, do the same there. That will probably give you some clues.
-Greg

On Tue, May 1, 2018 at 5:20 AM Sam Whitlock <phynominal@xxxxxxxxx> wrote:
I am using librados in application to read and write many small files (<128MB) concurrently, both in the same process and in different processes (across many nodes). The application is built on Tensorflow (the read and write operations are custom kernels I wrote).

I'm having an issue with this application where, after a few minutes, the all of my processes stop reading and writing to RADOS. In the debugging I can see that they're all waiting, with some variation of the following stack trace (edited for brevity), for various stat/read/write/write_full operations:

#0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 in Cond::Wait (this=this@entry=0x7f7f977dce20, mutex=...) at ./common/Cond.h:56
#2 in librados::IoCtxImpl::operate_read (this=this@entry=0x7f7ed40b4190, oid=..., o=o@entry=0x7f7f977dd050, pbl=pbl@entry=0x0, flags=flags@entry=0)
    at librados/IoCtxImpl.cc:725
#3 in librados::IoCtxImpl::stat (this=0x7f7ed40b4190, oid=..., psize=psize@entry=0x7f7f977dd198, pmtime=pmtime@entry=0x7f7f977dd1a0) at librados/IoCtxImpl.cc:1238
#4 in librados::IoCtx::stat (this=0x7f7f977dd290, oid=..., psize=0x7f7f977dd198, pmtime=0x7f7f977dd1a0) at librados/librados.cc:1260

The application then proceeds to complete requests at a glacial pace (~3-5 an hour) indefinitely.

When I run the application with a very low level of concurrency, it works properly. This "lock up" doesn't happen.

All reads and writes are to a single pool from the same user. No files are concurrently modified by different requests (i.e. completely independent / embarrassingly parallel architecture in my app).

How might I go about troubleshooting this? I'm not sure which logs to look at and what I might be looking for (if it is even logged).

I'm running Ceph 12.2.2, all machines running Ubuntu 16.04.

--
Sam Whitlock
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux