Re: RBD client wallclock profile during 4k random writes

Ning Yao <zay11022@xxxxxxxxx> · Fri, 19 May 2017 00:33:01 +0800

2017-05-11 8:21 GMT+08:00 Mark Nelson <mnelson@xxxxxxxxxx>:
>
>
> On 05/10/2017 06:24 PM, Mark Nelson wrote:
>>
>>
>>
>> On 05/10/2017 05:31 PM, Jason Dillaman wrote:
>>>
>>> On Wed, May 10, 2017 at 6:10 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>>>>
>>>> 1) 7 - tp_librbd, line 82
>>>>
>>>> Lots of stuff going on here, but the big thing is all the time spent in
>>>> librbd::ImageCtx::write_to_cache.  70.2% of the total time in this
>>>> thread is
>>>> spent in ObjectCacher::writex with lots of nested stuff, but if you
>>>> look all
>>>> the way down on line 1293, another 11.8% of the time is spent in
>>>> Locker()
>>>> and 1.5% of the time spent in ~Locker().
>>>
>>>
>>> Yes -- the ObjectCacher is long overdue for a re-write since it's
>>> single threaded. It looks like you were essentially performing
>>> writethrough as well. I'd imagine you would just be better off
>>> disabling the rbd cache when doing high-performance random write
>>> workloads since you are going to get zero benefit from the cache with
>>> that workload -- at least that's what I usually recommend.
>>>
>>
>> Often I do turn rbd cache off for bluestore testing.  This was an older
>> conf file where I inadvertently hadn't disabled it.  Still, it's an
>> unfortunate choice that has to be made, potentially by someone other
>> than the user running the workload. :/
>
>
> Yep, disabling rbd cache bumped 4K write IOPS from ~14K to ~31K, and closer
> ~36-37K with a higher IO depth at around 410% CPU usage.
>
> trace without rbd cache here:
>
> https://pastebin.com/t8FFsWNb
>
> looking a lot better, though thread 5 (tp_librbd) is still pegging. Still a
> little bit of locking in various places.  a bit more time shifted into
> _calc_target.
>
> async msgr is as expected a lot busier than it used to be.
>

So it seems calc pg mapping is an expensive operation. And
furthermore, if we always need to retry the loop because of collision
and rejection.

What about using a mapping cache table to look up the pg mapping
directly once it is calculated? I think We can re-calucate the pg
mapping until the osdmap epoch is changed, otherwise, pg mapping
should be consistent during one osdmap.

Regards
Ning Yao
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html