Re: DBObjectMap::header_lock forcing filestore threads to be sequentail

yuxiang fang <abcdeffyx@xxxxxxxxx> · Wed, 29 Nov 2017 12:46:16 +0800

Hi Xuehan

Have a look at DBObjectMap::_lookup_map_header, it does things following:

step 1. look up object header in memory cache, go step 2 if miss, or else return
step 2. look up object header in leveldb, go step 3 if hit, or else
return, that is code "db->get(HOBJECT_TO_SEQ, map_header_key(oid),
&out);"
step 3. add header to memory cache

step 1 and step 3 is cpu intensive, and step 2 is I/O intensive.

I think bottleneck is in step 2, you can watch cpu usage of filestore
transaction apply threads to confirm it.

So there are two solutions to improve it:
1. promote cache hit ratio, avoid to inject too much objects that have
omap k/v, maybe you can use xattr.
2. speed up step 2, separate omap directory off osd directory and move
leveldb to ssd by using filestore_omap_backend_path.

thanks
ivan from eisoo

On Wed, Nov 29, 2017 at 11:11 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote:
> Thanks, Greg and yuxiang:-)
>
> We used mdtest on 70 nodes issuing "file creations" to cephfs to test
> the maximum file creations that a cephfs instance with only one active
> mds can do. During the test, we found that, even after the test, there
> are still a lot of I/Os on the data pool, and the osd is under heavy
> presure, just as shown in the attachment "apply_latency" and
> "op_queue_ops". This should be caused by storing files' backtraces.
>
> We also used gdbprof to probe the OSD, the result of which is in the
> attachment "gdb.create.rocksdb.xfs.log". The result shows that the
> major time of the execution of the OSD's filestore threads is spent on
> waiting on the DBObjectMap::header_lock, and the only feasible actual
> execution of filestore threads is adding the object header to
> DBObjectMap::caches. Adding object headers would cause
> DBObjectMap::caches to trim in which the cache's size has to computed
> which is an O(N) operation in GNU STL list::size implementation. On
> the other hand, adding object headers is protected by locking
> "DBObjectMap::header_lock", and our configuration
> "filestore_omap_header_cache_size" is 204800 which is very large and
> would make the cache size computation take considerable time. So we
> think it may be appropriate to move the adding object header operation
> out of the locking field of  "DBObjectMap::header_lock", or maybe some
> other mechanism of DBObjectMap::caches should be considered.
>
> Thanks:-)
>
> On 29 November 2017 at 06:59, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>> On Tue, Nov 28, 2017 at 1:51 AM, Xuehan Xu <xxhdx1985126@xxxxxxxxx> wrote:
>>> Hi, everone.
>>>
>>> Recently, we did some stress tests on mds. We found that, when doing
>>> file creation test, the mdlog trim operations are very slow. After
>>> doing some debugging, we found that this could be due to execution of
>>> the OSD's filestore theads being forced to be nearly sequential. This
>>> can be found out in our result of gdbprof probing, which is attached
>>> with this email. In the gdbprof result, we can see that the most time
>>> consuming work of the filestore threads is the sizing of
>>> DBObjectMap::caches, and the reason of sequential execution of
>>> filestore threads is the locking of DBObjectMap::header_lock.
>>>
>>> After reading the corresponding source code, we found that
>>> MapHeaderLock is already doing the mutual exclusion of access of the
>>> omap object header. It seems that the locking of
>>> DBObjectMap::header_lock is not very necessary, or at least, it's not
>>> needed when adding the header to DBObjectMap::caches, which would lead
>>> to the sizing of the cache.
>>>
>>> Is this right?
>>
>> I'm a bit confused; can you explain exactly what you're testing and
>> exactly what you're measuring that leads you to think the mutexes are
>> overly expensive?
>>
>> Note that there's both a header_lock and a cache_lock; in a quick skim
>> they don't seem to be in gratuitous use (unless there are some disk
>> accesses hiding underneath the header_lock?) and the idea of them
>> being a performance bottleneck has not come up before.
>> -Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html