Re: troubleshooting librados error with concurrent requests

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Can you provide the full backtrace? It kinda looks like you've left something out.

In general though, a Wait inside of an operate call just means the thread has submitted its request and is waiting for the answer to come back. It could be blocked locally or remotely. If it's blocked remotely, the OSDs should be reporting to the mon/mgr that they have slow requests, which you can observe in "ceph -w" or whatever. If it's local, hrmm, not sure the easiest way to debug without just cranking up logging.
Generically, I'd use the admin socket on your clients to look at what the status of in-flight requests is, and to check the value of the throttle limits in the perfcounter. If the requests are being handled slowly on the OSD, do the same there. That will probably give you some clues.
-Greg

On Tue, May 1, 2018 at 5:20 AM Sam Whitlock <phynominal@xxxxxxxxx> wrote:
I am using librados in application to read and write many small files (<128MB) concurrently, both in the same process and in different processes (across many nodes). The application is built on Tensorflow (the read and write operations are custom kernels I wrote).

I'm having an issue with this application where, after a few minutes, the all of my processes stop reading and writing to RADOS. In the debugging I can see that they're all waiting, with some variation of the following stack trace (edited for brevity), for various stat/read/write/write_full operations:

#0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1 in Cond::Wait (this=this@entry=0x7f7f977dce20, mutex=...) at ./common/Cond.h:56
#2 in librados::IoCtxImpl::operate_read (this=this@entry=0x7f7ed40b4190, oid=..., o=o@entry=0x7f7f977dd050, pbl=pbl@entry=0x0, flags=flags@entry=0)
    at librados/IoCtxImpl.cc:725
#3 in librados::IoCtxImpl::stat (this=0x7f7ed40b4190, oid=..., psize=psize@entry=0x7f7f977dd198, pmtime=pmtime@entry=0x7f7f977dd1a0) at librados/IoCtxImpl.cc:1238
#4 in librados::IoCtx::stat (this=0x7f7f977dd290, oid=..., psize=0x7f7f977dd198, pmtime=0x7f7f977dd1a0) at librados/librados.cc:1260

The application then proceeds to complete requests at a glacial pace (~3-5 an hour) indefinitely.

When I run the application with a very low level of concurrency, it works properly. This "lock up" doesn't happen.

All reads and writes are to a single pool from the same user. No files are concurrently modified by different requests (i.e. completely independent / embarrassingly parallel architecture in my app).

How might I go about troubleshooting this? I'm not sure which logs to look at and what I might be looking for (if it is even logged).

I'm running Ceph 12.2.2, all machines running Ubuntu 16.04.

--
Sam Whitlock
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux