On Mon, Jun 4, 2018 at 2:08 AM, Yingxin Cheng <yingxincheng@xxxxxxxxx> wrote: > <sorry, forgot to add ceph-devel and Mahati> > > 2018-06-01 20:21 GMT+08:00 Jason Dillaman <jdillama@xxxxxxxxxx>: >> On Fri, Jun 1, 2018 at 4:20 AM, Yingxin Cheng <yingxincheng@xxxxxxxxx> wrote: >> >> Have you disabled the librbd in-memory cache during your tests? The >> cache has a giant global lock that causes plenty of thread contention. > > I actually examined 3 giant global locks when writethrough cache is enabled: > a) librbd::io::ObjectDispatcher::mlock wait avg: > 1.15us (1 worker) -> 8.58us (8 workers) > b) librbd::io::ImageCtx::snap_lock wait avg: > 1.14us (1 worker) -> 7.09us (8 workers) > c) librbd::cache::ObjectCacherObjectDispatch::m_cache_lock wait avg: > 0.75us (1 worker) -> 1385us (8 workers) > I think it's because the critical area of m_cache_lock is too huge, it > prevents the expensive "writex" operations from concurrent execution. Yup, that's the bottleneck I was talking about (since the current in-memory cache isn't thread safe w/o a global lock). > === > Surprisingly, even after the cache is disabled, the worker contentions > are still there. (see the updated graphs in > https://docs.google.com/document/d/1r8VJiTbs68X42Hncur48pPlZbL_yw8BTSdSKxBkrSPk/edit?usp=sharing). > So I turned to another global lock "ThreadPool::_lock". > > According to the implementation in of `ThreadPool::worker(WorkThread > *wt)`, each of the librbd worker threads has to do 4 types of things > related to this lock: > a) execute inside the critical area of _lock, including > `_void_dequeue` the item; > b) execute outside the critical area to `process()` the dequeued item; > c) wait for entering the _lock after the `process()` completed; > d) sleep and try to re-enter the _lock when workqueue is empty > > It is guaranteed that a) + b) + c) + d) account for the 100% of the > worker's lifecycle. And here are the results when cache is disabled: > 1 worker, 2 workers, 4 workers, 10 workers > a) 11.14%, 13.99%, 15.38%, 06.68% > b) 83.83%, 79.11%, 66.60%, 21.44% > c) 05.03%, 06.89%, 15.40%, 17.59% > d) 00.00%, 00.00%, 02.63%, 54.29% > Further, the absolute locked time of "ThreadPool::_lock" takes 11.14%, > 27.98%, 61.52%, 66.84% of the total recorded period when there are > 1/2/4/10 workers. > I think it implies that this implementation also needs to be improved. Definitely a good area to investigate/improve. >> The next known spot for thread contention is does in librados since >> each per-OSD session has a lock, so the fewer OSDs you have, the >> higher the probability for IO contention. > > I have 3 OSDs in this environment. > >> Finally, within librados, >> all AIO completions are fired from a single thread -- so even if you >> are pumping data to the OSDs using 8 threads, you are only getting >> serialized completions. >> >> Just prior to Cephalocon I had created a test branch which switched >> the librados AIO completions to the fast-dispatcher path, which had a >> noticeable improvement in latency. Mahati (CCed) is also investigating >> librbd/librados performance. >> >> >> >> >> -- >> Jason > > I also have a question when tried to enable multiple RBD workers. > What's the status of http://tracker.ceph.com/issues/17379? Is it still > on-going? There is a PR w/ some fixes for random state machine race conditions that can occur, but I think in practice additional race conditions still need to be discovered and fixed [1]. > --Yingxin > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html [1] https://github.com/ceph/ceph/pull/20482 -- Jason -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html