Re: Librbd performance issue

Jason Dillaman <jdillama@xxxxxxxxxx> · Mon, 4 Jun 2018 08:24:23 -0400

On Mon, Jun 4, 2018 at 2:08 AM, Yingxin Cheng <yingxincheng@xxxxxxxxx> wrote:
> <sorry, forgot to add ceph-devel and Mahati>
>
> 2018-06-01 20:21 GMT+08:00 Jason Dillaman <jdillama@xxxxxxxxxx>:
>> On Fri, Jun 1, 2018 at 4:20 AM, Yingxin Cheng <yingxincheng@xxxxxxxxx> wrote:
>>
>> Have you disabled the librbd in-memory cache during your tests? The
>> cache has a giant global lock that causes plenty of thread contention.
>
> I actually examined 3 giant global locks when writethrough cache is enabled:
> a) librbd::io::ObjectDispatcher::mlock wait avg:
>            1.15us (1 worker) -> 8.58us (8 workers)
> b) librbd::io::ImageCtx::snap_lock wait avg:
>            1.14us (1 worker) -> 7.09us (8 workers)
> c) librbd::cache::ObjectCacherObjectDispatch::m_cache_lock wait avg:
>            0.75us (1 worker) -> 1385us (8 workers)
> I think it's because the critical area of m_cache_lock is too huge, it
> prevents the expensive "writex" operations from concurrent execution.

Yup, that's the bottleneck I was talking about (since the current
in-memory cache isn't thread safe w/o a global lock).

> ===
> Surprisingly, even after the cache is disabled, the worker contentions
> are still there. (see the updated graphs in
> https://docs.google.com/document/d/1r8VJiTbs68X42Hncur48pPlZbL_yw8BTSdSKxBkrSPk/edit?usp=sharing).
> So I turned to another global lock "ThreadPool::_lock".
>
> According to the implementation in of `ThreadPool::worker(WorkThread
> *wt)`, each of the librbd worker threads has to do 4 types of things
> related to this lock:
> a) execute inside the critical area of _lock, including
> `_void_dequeue` the item;
> b) execute outside the critical area to `process()` the dequeued item;
> c) wait for entering the _lock after the `process()` completed;
> d) sleep and try to re-enter the _lock when workqueue is empty
>
> It is guaranteed that a) + b) + c) + d) account for the 100% of the
> worker's lifecycle. And here are the results when cache is disabled:
>    1 worker, 2 workers, 4 workers, 10 workers
> a)  11.14%,   13.99%,   15.38%,   06.68%
> b)  83.83%,   79.11%,   66.60%,   21.44%
> c)  05.03%,   06.89%,   15.40%,   17.59%
> d)  00.00%,   00.00%,   02.63%,   54.29%
> Further, the absolute locked time of "ThreadPool::_lock" takes 11.14%,
> 27.98%, 61.52%, 66.84% of the total recorded period when there are
> 1/2/4/10 workers.
> I think it implies that this implementation also needs to be improved.

Definitely a good area to investigate/improve.

>> The next known spot for thread contention is does in librados since
>> each per-OSD session has a lock, so the fewer OSDs you have, the
>> higher the probability for IO contention.
>
> I have 3 OSDs in this environment.
>
>> Finally, within librados,
>> all AIO completions are fired from a single thread -- so even if you
>> are pumping data to the OSDs using 8 threads, you are only getting
>> serialized completions.
>>
>> Just prior to Cephalocon I had created a test branch which switched
>> the librados AIO completions to the fast-dispatcher path, which had a
>> noticeable improvement in latency. Mahati (CCed) is also investigating
>> librbd/librados performance.
>>
>>
>>
>>
>> --
>> Jason
>
> I also have a question when tried to enable multiple RBD workers.
> What's the status of http://tracker.ceph.com/issues/17379? Is it still
> on-going?

There is a PR w/ some fixes for random state machine race conditions
that can occur, but I think in practice additional race conditions
still need to be discovered and fixed [1].

> --Yingxin
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[1] https://github.com/ceph/ceph/pull/20482

-- 
Jason
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html