Re: Librbd performance issue

Yingxin Cheng <yingxincheng@xxxxxxxxx> · Mon, 4 Jun 2018 14:08:13 +0800

<sorry, forgot to add ceph-devel and Mahati>

2018-06-01 20:21 GMT+08:00 Jason Dillaman <jdillama@xxxxxxxxxx>:
> On Fri, Jun 1, 2018 at 4:20 AM, Yingxin Cheng <yingxincheng@xxxxxxxxx> wrote:
>
> Have you disabled the librbd in-memory cache during your tests? The
> cache has a giant global lock that causes plenty of thread contention.

I actually examined 3 giant global locks when writethrough cache is enabled:
a) librbd::io::ObjectDispatcher::mlock wait avg:
           1.15us (1 worker) -> 8.58us (8 workers)
b) librbd::io::ImageCtx::snap_lock wait avg:
           1.14us (1 worker) -> 7.09us (8 workers)
c) librbd::cache::ObjectCacherObjectDispatch::m_cache_lock wait avg:
           0.75us (1 worker) -> 1385us (8 workers)
I think it's because the critical area of m_cache_lock is too huge, it
prevents the expensive "writex" operations from concurrent execution.

===
Surprisingly, even after the cache is disabled, the worker contentions
are still there. (see the updated graphs in
https://docs.google.com/document/d/1r8VJiTbs68X42Hncur48pPlZbL_yw8BTSdSKxBkrSPk/edit?usp=sharing).
So I turned to another global lock "ThreadPool::_lock".

According to the implementation in of `ThreadPool::worker(WorkThread
*wt)`, each of the librbd worker threads has to do 4 types of things
related to this lock:
a) execute inside the critical area of _lock, including
`_void_dequeue` the item;
b) execute outside the critical area to `process()` the dequeued item;
c) wait for entering the _lock after the `process()` completed;
d) sleep and try to re-enter the _lock when workqueue is empty

It is guaranteed that a) + b) + c) + d) account for the 100% of the
worker's lifecycle. And here are the results when cache is disabled:
   1 worker, 2 workers, 4 workers, 10 workers
a)  11.14%,   13.99%,   15.38%,   06.68%
b)  83.83%,   79.11%,   66.60%,   21.44%
c)  05.03%,   06.89%,   15.40%,   17.59%
d)  00.00%,   00.00%,   02.63%,   54.29%
Further, the absolute locked time of "ThreadPool::_lock" takes 11.14%,
27.98%, 61.52%, 66.84% of the total recorded period when there are
1/2/4/10 workers.
I think it implies that this implementation also needs to be improved.

> The next known spot for thread contention is does in librados since
> each per-OSD session has a lock, so the fewer OSDs you have, the
> higher the probability for IO contention.

I have 3 OSDs in this environment.

> Finally, within librados,
> all AIO completions are fired from a single thread -- so even if you
> are pumping data to the OSDs using 8 threads, you are only getting
> serialized completions.
>
> Just prior to Cephalocon I had created a test branch which switched
> the librados AIO completions to the fast-dispatcher path, which had a
> noticeable improvement in latency. Mahati (CCed) is also investigating
> librbd/librados performance.
>
>
>
>
> --
> Jason

I also have a question when tried to enable multiple RBD workers.
What's the status of http://tracker.ceph.com/issues/17379? Is it still
on-going?

--Yingxin
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html