I am using librados in application to read and write many small files (<128MB) concurrently, both in the same process and in different processes (across many nodes). The application is built on Tensorflow (the read and write operations are custom kernels I wrote).
I'm having an issue with this application where, after a few minutes, the all of my processes stop reading and writing to RADOS. In the debugging I can see that they're all waiting, with some variation of the following stack trace (edited for brevity), for various stat/read/write/write_full operations:
#0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 in Cond::Wait (this=this@entry=0x7f7f977dce20, mutex=...) at ./common/Cond.h:56
#2 in librados::IoCtxImpl::operate_read (this=this@entry=0x7f7ed40b4190, oid=..., o=o@entry=0x7f7f977dd050, pbl=pbl@entry=0x0, flags=flags@entry=0)
at librados/IoCtxImpl.cc:725
#3 in librados::IoCtxImpl::stat (this=0x7f7ed40b4190, oid=..., psize=psize@entry=0x7f7f977dd198, pmtime=pmtime@entry=0x7f7f977dd1a0) at librados/IoCtxImpl.cc:1238
#4 in librados::IoCtx::stat (this=0x7f7f977dd290, oid=..., psize=0x7f7f977dd198, pmtime=0x7f7f977dd1a0) at librados/librados.cc:1260
The application then proceeds to complete requests at a glacial pace (~3-5 an hour) indefinitely.
When I run the application with a very low level of concurrency, it works properly. This "lock up" doesn't happen.
All reads and writes are to a single pool from the same user. No files are concurrently modified by different requests (i.e. completely independent / embarrassingly parallel architecture in my app).
How might I go about troubleshooting this? I'm not sure which logs to look at and what I might be looking for (if it is even logged).
I'm running Ceph 12.2.2, all machines running Ubuntu 16.04.
--
Sam Whitlock
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com