Librbd performance issue

Yingxin Cheng <yingxincheng@xxxxxxxxx> · Fri, 1 Jun 2018 16:20:58 +0800

Hi Cephers,

I'm investigating the performance of datapath `librbd.aio_write` using
distributed-tracing tech into the entire Ceph cluster. The basic idea
is to calculate the internal request-throughputs from tracepoints in
Ceph source code. The bottleneck can then be identified by finding out
the code snippets where the throughput drops most significantly. And
here is one identified related to ImageRequestWQ:

In my test cluster (the latest dev code), I triggered 1000 rand-4K
aio_write requests into librbd. The throughput of
`ImageRequestWQ::queue()` reaches ~30000 IOPS. But the throughput of
`ImageRequestWQ::_void_dequeue()` and the following `process()` drops
significantly at only ~11000 IOPS [1]. This means the maximum internal
consumption rate of rbd workers is ~11000 IOPS in this scenario, with
default setting "rbd op thread = 1".

So it must be the problem of "not enough workers". Then I tried to
increase the number of workers to 8. However, the throughput of
`_void_dequeue()` didn't increase. Instead, it drops to only ~3200
IOPS [2]. This implies there are too many resource contentions between
multiple rbd op workers.

I'm trying to figure out root causes of this problem. But firstly I
want to ask is there any existing related progress from the community?
Or is there any other helpful information that can help narrow down
the root causes?

[1-2] https://docs.google.com/document/d/1r8VJiTbs68X42Hncur48pPlZbL_yw8BTSdSKxBkrSPk/edit?usp=sharing

Thanks!
Yingxin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html