Hi Xuehan Have a look at DBObjectMap::_lookup_map_header, it does things following: step 1. look up object header in memory cache, go step 2 if miss, or else return step 2. look up object header in leveldb, go step 3 if hit, or else return, that is code "db->get(HOBJECT_TO_SEQ, map_header_key(oid), &out);" step 3. add header to memory cache step 1 and step 3 is cpu intensive, and step 2 is I/O intensive. I think bottleneck is in step 2, you can watch cpu usage of filestore transaction apply threads to confirm it. So there are two solutions to improve it: 1. promote cache hit ratio, avoid to inject too much objects that have omap k/v, maybe you can use xattr. 2. speed up step 2, separate omap directory off osd directory and move leveldb to ssd by using filestore_omap_backend_path. thanks ivan from eisoo On Wed, Nov 29, 2017 at 11:11 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: > Thanks, Greg and yuxiang:-) > > We used mdtest on 70 nodes issuing "file creations" to cephfs to test > the maximum file creations that a cephfs instance with only one active > mds can do. During the test, we found that, even after the test, there > are still a lot of I/Os on the data pool, and the osd is under heavy > presure, just as shown in the attachment "apply_latency" and > "op_queue_ops". This should be caused by storing files' backtraces. > > We also used gdbprof to probe the OSD, the result of which is in the > attachment "gdb.create.rocksdb.xfs.log". The result shows that the > major time of the execution of the OSD's filestore threads is spent on > waiting on the DBObjectMap::header_lock, and the only feasible actual > execution of filestore threads is adding the object header to > DBObjectMap::caches. Adding object headers would cause > DBObjectMap::caches to trim in which the cache's size has to computed > which is an O(N) operation in GNU STL list::size implementation. On > the other hand, adding object headers is protected by locking > "DBObjectMap::header_lock", and our configuration > "filestore_omap_header_cache_size" is 204800 which is very large and > would make the cache size computation take considerable time. So we > think it may be appropriate to move the adding object header operation > out of the locking field of "DBObjectMap::header_lock", or maybe some > other mechanism of DBObjectMap::caches should be considered. > > Thanks:-) > > On 29 November 2017 at 06:59, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: >> On Tue, Nov 28, 2017 at 1:51 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote: >>> Hi, everone. >>> >>> Recently, we did some stress tests on mds. We found that, when doing >>> file creation test, the mdlog trim operations are very slow. After >>> doing some debugging, we found that this could be due to execution of >>> the OSD's filestore theads being forced to be nearly sequential. This >>> can be found out in our result of gdbprof probing, which is attached >>> with this email. In the gdbprof result, we can see that the most time >>> consuming work of the filestore threads is the sizing of >>> DBObjectMap::caches, and the reason of sequential execution of >>> filestore threads is the locking of DBObjectMap::header_lock. >>> >>> After reading the corresponding source code, we found that >>> MapHeaderLock is already doing the mutual exclusion of access of the >>> omap object header. It seems that the locking of >>> DBObjectMap::header_lock is not very necessary, or at least, it's not >>> needed when adding the header to DBObjectMap::caches, which would lead >>> to the sizing of the cache. >>> >>> Is this right? >> >> I'm a bit confused; can you explain exactly what you're testing and >> exactly what you're measuring that leads you to think the mutexes are >> overly expensive? >> >> Note that there's both a header_lock and a cache_lock; in a quick skim >> they don't seem to be in gratuitous use (unless there are some disk >> accesses hiding underneath the header_lock?) and the idea of them >> being a performance bottleneck has not come up before. >> -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html