Re: ceph issue

Haomai Wang <haomai@xxxxxxxx> · Fri, 2 Dec 2016 11:12:05 +0800

On Wed, Nov 23, 2016 at 5:30 PM, Avner Ben Hanoch <avnerb@xxxxxxxxxxxx> wrote:
>
> I guess that like the rest of ceph, the new rdma code must also support multiple applications in parallel.
>
> I am also reproducing your error => 2 instances of fio can't run in parallel with ceph rdma.
>
> * with ceph -s shows HEALTH_WARN (with "9 requests are blocked > 32 sec")
>
> * and with all osds printing messages like " heartbeat_check: no reply from ..."
>
> * And with log files contains errors:
>   $ grep error ceph-osd.0.log
>   2016-11-23 09:20:46.988154 7f9b26260700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory
>   2016-11-23 09:20:54.090388 7f9b43951700  1 -- 36.0.0.2:6802/10634 >> 36.0.0.4:0/19587 conn(0x7f9b256a8000 :6802 s=STATE_OPEN pgs=1 cs=1 l=1).read_bulk reading from fd=139 : Unknown error -104
>   2016-11-23 09:20:58.411912 7f9b44953700  1 RDMAStack polling work request returned error for buffer(0x7f9b1fee21b0) status(12:RETRY_EXC_ERR
>   2016-11-23 09:20:58.411934 7f9b44953700  1 RDMAStack polling work request returned error for buffer(0x7f9b553d20d0) status(12:RETRY_EXC_ERR

error is "IBV_WC_RETRY_EXC_ERR (12) - Transport Retry Counter
Exceeded: The local transport timeout retry counter was exceeded while
trying to send this message. This means that the remote side didn't
send any Ack or Nack. If this happens when sending the first message,
usually this mean that the connection attributes are wrong or the
remote side isn't in a state that it can respond to messages. If this
happens after sending the first message, usually it means that the
remote QP isn't available anymore. Relevant for RC QPs."

we set qp retry_cnt to 7 and timeout is 14

  // How long to wait before retrying if packet lost or server dead.
  // Supposedly the timeout is 4.096us*2^timeout.  However, the actual
  // timeout appears to be 4.096us*2^(timeout+1), so the setting
  // below creates a 135ms timeout.
  qpa.timeout = 14;

  // How many times to retry after timeouts before giving up.
  qpa.retry_cnt = 7;

is this means the receiver side lack of memory or not polling work request ASAP?

>
>
>
> Command lines that I used:
>   ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 --clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g --name=1
>   ./fio --ioengine=rbd --invalidate=0 --rw=write --bs=128K --numjobs=1 --clientname=admin --pool=rbd --iodepth=128 --rbdname=img2g2 --name=1
>
> > -----Original Message-----
> > From: Marov Aleksey
> > Sent: Tuesday, November 22, 2016 17:59
> >
> > I didn't try this blocksize. But in my case fio crushed if I use more than one
> > job. With one job everything works fine. Is it worth more deep investigating?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html