After looking into “ThreadPool::_lock” and the related implementations in ImageRequestWQ, it turns out that the “lockdep” check in my vstart environment is the major cause that slows the entire io-workqueue down. So I disabled “lockdep” and did another round of experiments: [1-2] With cache disabled, the internal io-worker performance can be improved from IOPS 16000 to 31000, by adding op workers (from 1 to 8). [3-4] With cache enabled, the internal worker performance is worse (IOPS down 37.5% with 1 worker), and adding workers will further decrease the performance (internal IOPS down 60.6% with 4 workers). ImageRequestWQ itself is no longer a bottleneck, because the waiting time of “ThreadPool::_lock” is not that significant, and I can't get better performance numbers by removing blockers inside the lock. I think the above results show two major improvement directions. a) A better cache design that can be multi-threaded. b) Allow multiple io-workers in librbd, and in potential this can bring up to 200% IOPS improvement to trigger RADOS writes into OSDs. [1-4] https://docs.google.com/document/d/1r8VJiTbs68X42Hncur48pPlZbL_yw8BTSdSKxBkrSPk/edit?usp=sharing --------- I’m still looking forward to b) Allow multiple io-workers in librbd. https://github.com/ceph/ceph/pull/20482 seems to implement a sane destruction order when there are multiple workers. If it doesn’t fix all race conditions, what are other scenarios? Is there any existing error logs available, or any unit/integration tests that I can refer to? I didn’t see any explicit failures during my experiments with multiple worker configuration. --Yingxin -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html