Re: Work update related to rocksdb

xiaoyan li <wisher2003@xxxxxxxxx> · Tue, 17 Oct 2017 10:29:00 +0800

Hi Sage and Mark,
A question here: OMAP pg logs are added by "set", are they only
deleted by rm_range_keys in BlueStore?
https://github.com/ceph/ceph/pull/18279/files
If yes, maybe when dedup, we don't need to compare the keys in all
memtables, we just compare keys in current memtable with rm_range_keys
in later memtables?

On Tue, Oct 17, 2017 at 10:18 AM, xiaoyan li <wisher2003@xxxxxxxxx> wrote:
> Hi Sage and Mark,
> Following tests results I give are tested based on KV sequences got
> from librbd+fio 4k or 16k random writes in 30 mins.
> In my opinion, we may use dedup flush style for onodes and deferred
> data, but use default merge flush style for other data.
>
> On Mon, Oct 16, 2017 at 9:50 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>>
>>
>> On 10/16/2017 08:28 AM, Sage Weil wrote:
>>>
>>> [adding ceph-devel]
>>>
>>> On Mon, 16 Oct 2017, Mark Nelson wrote:
>>>>
>>>> Hi Lisa,
>>>>
>>>> Excellent testing!   This is exactly what we were trying to understand.
>>>>
>>>> On 10/16/2017 12:55 AM, Li, Xiaoyan wrote:
>>>>>
>>>>> Hi Mark,
>>>>>
>>>>> Based on my testing, when setting min_write_buffer_number_to_merge as 2,
>>>>> the
>>>>> onodes and deferred data written into L0 SST can decreased a lot with my
>>>>> rocksdb dedup package.
>>>>>
>>>>> But for omap data, it needs to span more memtables. I tested omap data
>>>>> in
>>>>> separate column family. From the data, you can see when
>>>>> min_write_buffer_number_to_merge is set to 4, the data written into L0
>>>>> SST
>>>>> is good. That means it has to compare current memTable to flush with
>>>>> later 3
>>>>> memtables recursively.
>>>>> kFlushStyleDedup is to new flush style in my rocksdb dedup package.
>>>>> kFlushStyleMerge is current flush style in master branch.
>>>>>
>>>>> But this is just considered from data written into L0. With more
>>>>> memtables
>>>>> to compare, it sacrifices CPU and computing time.
>>>>>
>>>>> Memtable size: 256MB
>>>>> max_write_buffer_number min_write_buffer_number_to_merge
>>>>> flush_style     Omap data written into L0 SST(unit: MB)
>>>>> 16      8       kFlushStyleMerge        7665
>>>>> 16      8       kFlushStyleDedup        3770
>>>>> 8       4       kFlushStyleMerge        11470
>>>>> 8       4       kFlushStyleDedup        3922
>>>>> 6       3       kFlushStyleMerge        14059
>>>>> 6       3       kFlushStyleDedup        5001
>>>>> 4       2       kFlushStyleMerge        18683
>>>>> 4       2       kFlushStyleDedup        15394
>>>>
>>>>
>>>> Is this only omap data or all data?  It looks like the 6/3 or 8/4 is
>>>> still
>>>> probably the optimal point (And the improvements are quite noticeable!).
> This is only omap data. Dedup can decrease data written into L0 SST,
> but it needs to compare too many memtables.
>
>>>> Sadly we were hoping we might be able to get away with smaller memtables
>>>> (say
>>>> 64MB) with KFlushStyleDedup.  It looks like that might not be the case
>>>> unless
>>>> we increase the number very high.
>>>>
>>>> Sage, this is going to be even worse if we try to keep more pglog entries
>>>> around on flash OSD backends?
>>>
>>>
>>> I think there are three or more factors at play here:
>>>
>>> 1- If we reduce the memtable size, the CPU cost of insertion (baseline)
>>> and the dedup cost will go down.
>>>
>>> 2- If we switch to a small min pg log entries, then most pg log keys
>>> *will* fall into the smaller window (of small memtables * small
>>> min_write_buffer_to_merge).  The dup op keys probably won't, though...
>>> except maybe they will because the values are small and more of them will
>>> fit into the memtables.  But then
>>>
>>> 3- If we have more keys and smaller values, then the CPU overhead will be
>>> higher again.
>>>
>>> For PG logs, I didn't really expect that the dedup style would help; I was
>>> only thinking about the deferred keys.  I wonder if it would make sense to
>>> specify a handful of key prefixes to attempt dedup on, and not bother on
>>> the others?
>>
>>
>> Deferred keys seem to be a much smaller part of the problem right now than
>> pglog.  At least based on what I'm seeing at the moment with NVMe testing.
>> Regarding dedup, it's probably worth testing at the very least.
> I did following tests: all data in default column family. Set
> min_write_buffer_to_merge to 2, check the size of kinds of data
> written into L0 SST files.
> From the data, onodes and deferred data can be removed a lot in dedup style.
>
> Data written into L0 SST files:
>
> 4k random writes (unit: MB)
> FlushStyle      Omap              onodes            deferred           others
> merge       22431.56        23224.54       1530.105          0.906106
> dedup       22188.28        14161.18        12.68681         0.90906
>
> 16k random writes (unit: MB)
> FlushStyle      Omap              onodes            deferred           others
> merge           19260.20          8230.02           0                    1914.50
> dedup           19154.92          2603.90           0
>    2517.15
>
> Note here: for others type, which use "merge" operation, dedup style
> can't make it more efficient. In later, we can set it in separate CF,
> use default merge flush style.
>
>>
>>>
>>> Also, there is the question of where the CPU time is spent.
>>
>>
>> Indeed, but if we can reduce the memtable size it means we save CPU in other
>> areas.  Like you say below, it's complicated.
>>>
>>>
>>> 1- Big memtables means we spend more time in submit_transaction, called by
>>> the kv_sync_thread, which is a bottleneck.
>>
>>
>> At least on NVMe we see it pretty regularly in the wallclock traces.  I need
>> to retest with Radoslav and Adam's hugepages PR to get a feel for how bad it
>> is after that.
>>
>>>
>>> 2- Higher dedup style flush CPU usage is spent in the compaction thread(s)
>>> (I think?), which are asynchronous.
>>
>>
>> L0 compaction is single threaded though so we must be careful....
>>
>>>
>>> At the end of the day I think we need to use less CPU total, so the
>>> optimization of the above factors is a bit complicated.  OTOH if the goal
>>> is IOPS at whatever cost it'll probably mean a slightly different choice.
>>
>>
>> I guess we should consider the trends.  Lots of cores, lots of flash cells.
>> How do we balance high throughput and low latency?
>>
>>>
>>> I would *expect* that if we go from, say, 256mb tables to 64mb tables and
>>> dedup of <= 4 of them, then we'll see a modest net reduction of total CPU
>>> *and* a shift to the compaction threads.
>>
>>
>> It seems like based on Lisa's test results that's too short lived? Maybe I'm
>> not understanding what you mean?
>>
>>>
>>> And changing the pg log min entries will counterintuitively increase the
>>> costs of insertion and dedup flush because more keys will fit in the same
>>> amount of memtable... but if we reduce the memtable size at the same time
>>> we might get a win there too?  Maybe?
>>
>>
>> There's too much variability here to theorycraft it and your "maybe"
>> statement confirms for me. ;)  We need to get a better handle on what's
>> going on.
>>
>>>
>>> Lisa, do you think limiting the dedup check during flush to specific
>>> prefixes would make sense as a general capability?  If so, we could target
>>> this *just* at the high-value keys (e.g., deferred writes) and avoid
>>> incurring very much additional overhead for the key ranges that aren't
>>> sure bets.
> The easiest way to do it is to set data in different CFs, and use
> different flush style(dedup or merge) in different CFs.
>
>>
>>
>> At least in my testing deferred writes during rbd 4k random writes are
>> almost negligible:
>>
>> http://pad.ceph.com/p/performance_weekly
>>
>> I suspect it's all going to be about OMAP.  We need a really big WAL that
>> can keep OMAP around for a long time while quickly flushing object data into
>> small memtables.  On disk it's a big deal that this gets layed out
>> sequentially but on flash I'm wondering if we'd be better off with a
>> separate WAL for OMAP (a different rocksdb shard or different data store
>> entirely).
> Yes, OMAP data is main data written into L0 SST.
>
> Data written into every memtable: (uint: MB)
> IO load          omap          ondes          deferred          others
> 4k RW          37584          85253          323887          250
> 16k RW        33687          73458          0                   3500
>
> In merge flush style with min_buffer_to_merge=2.
> Data written into every L0 SST: (unit MB)
> IO load     Omap              onodes            deferred           others
> 4k RW       22188.28        14161.18        12.68681         0.90906
> 16k RW     19260.20          8230.02           0                    1914.50
>
>>
>> Mark
>>
>>
>>>
>>> sage
>>>
>>>
>>>>> The above KV operation sequences come from 4k random writes in 30mins.
>>>>> Overall, the Rocksdb dedup package can decrease the data written into L0
>>>>> SST, but it needs more comparison. In my opinion, whether to use dedup,
>>>>> it
>>>>> depends on the configuration of the OSD host: whether disk is over busy
>>>>> or
>>>>> CPU is over busy.
>>>>
>>>>
>>>> Do you have any insight into how much CPU overhead it adds?
>>>>
>>>>>
>>>>> Best wishes
>>>>> Lisa
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Best wishes
> Lisa

-- 
Best wishes
Lisa
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html