Hi Haomai, Alexey With latest async/rdma code I don't see the fio errors (not for multiple fio instances neither to big block size) - thanks for your work Haomai. Alexey - do you still see any issue with fio? Regards, Avner > -----Original Message----- > From: Haomai Wang [mailto:haomai@xxxxxxxx] > Sent: Friday, December 02, 2016 05:12 > To: Avner Ben Hanoch <avnerb@xxxxxxxxxxxx> > Cc: Marov Aleksey <Marov.A@xxxxxxxxxx>; Sage Weil <sweil@xxxxxxxxxx>; > ceph-devel@xxxxxxxxxxxxxxx > Subject: Re: ceph issue > > On Wed, Nov 23, 2016 at 5:30 PM, Avner Ben Hanoch > <avnerb@xxxxxxxxxxxx> wrote: > > > > I guess that like the rest of ceph, the new rdma code must also support > multiple applications in parallel. > > > > I am also reproducing your error => 2 instances of fio can't run in parallel > with ceph rdma. > > > > * with ceph -s shows HEALTH_WARN (with "9 requests are blocked > 32 > > sec") > > > > * and with all osds printing messages like " heartbeat_check: no reply from > ..." > > > > * And with log files contains errors: > > $ grep error ceph-osd.0.log > > 2016-11-23 09:20:46.988154 7f9b26260700 -1 Fail to open '/proc/0/cmdline' > error = (2) No such file or directory > > 2016-11-23 09:20:54.090388 7f9b43951700 1 -- 36.0.0.2:6802/10634 >> > 36.0.0.4:0/19587 conn(0x7f9b256a8000 :6802 s=STATE_OPEN pgs=1 cs=1 > l=1).read_bulk reading from fd=139 : Unknown error -104 > > 2016-11-23 09:20:58.411912 7f9b44953700 1 RDMAStack polling work > request returned error for buffer(0x7f9b1fee21b0) status(12:RETRY_EXC_ERR > > 2016-11-23 09:20:58.411934 7f9b44953700 1 RDMAStack polling work > > request returned error for buffer(0x7f9b553d20d0) > > status(12:RETRY_EXC_ERR > > error is "IBV_WC_RETRY_EXC_ERR (12) - Transport Retry Counter > Exceeded: The local transport timeout retry counter was exceeded while > trying to send this message. This means that the remote side didn't send any > Ack or Nack. If this happens when sending the first message, usually this mean > that the connection attributes are wrong or the remote side isn't in a state > that it can respond to messages. If this happens after sending the first > message, usually it means that the remote QP isn't available anymore. > Relevant for RC QPs." > > we set qp retry_cnt to 7 and timeout is 14 > > // How long to wait before retrying if packet lost or server dead. > // Supposedly the timeout is 4.096us*2^timeout. However, the actual > // timeout appears to be 4.096us*2^(timeout+1), so the setting > // below creates a 135ms timeout. > qpa.timeout = 14; > > // How many times to retry after timeouts before giving up. > qpa.retry_cnt = 7; > > is this means the receiver side lack of memory or not polling work request > ASAP? > > > > > > > > > Command lines that I used: > > ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 -- > clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g --name=1 > > ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 > > --clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g2 --name=1 > > > > > -----Original Message----- > > > From: Marov Aleksey > > > Sent: Tuesday, November 22, 2016 17:59 > > > > > > I didn't try this blocksize. But in my case fio crushed if I use > > > more than one job. With one job everything works fine. Is it worth more > deep investigating? ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f