Re: blustore: revisiting rocksdb buffer settings

Xiaoxi Chen <superdebuger@xxxxxxxxx> · Wed, 7 Jun 2017 09:33:16 +0800

2017-06-06 23:29 GMT+08:00 Haomai Wang <haomai@xxxxxxxx>:
> On Tue, Jun 6, 2017 at 11:23 PM, Xiaoxi Chen <superdebuger@xxxxxxxxx> wrote:
>> Hi Mark,
>>       Cool data, wondering if we take the compaction wallclock time
>> into account, is 32MB memtable still showing advantage on CPU
>> consumption?  I am expecting 256MB is better as compaction workload
>> reduced significantly.
>
> I don't have the real data about the different, but from the
> implementation the larger table size, it will cause worser internal
> fragment. If want to achieve the same live utilization compared to
> smaller table size, it need to do more compaction traffic than
> smaller.
>

I totally hear your point.  But compaction is not only from L0 to L1,
it is in all leve.  Merging short live KV creation/deletion and
several update against same KV at upper level (L0) should be more
efficient than do it later on lower level, isnt it?

> Hmm, I'm not sure my description answer the question....
>
>>
>> Xiaoxi
>>
>> 2017-06-02 21:48 GMT+08:00 Mark Nelson <mnelson@xxxxxxxxxx>:
>>> Hi all,
>>>
>>> Last fall we ran through some tests to try to determine how many and what
>>> size write buffers (ie the memtable size) should be used in rocksdb for
>>> bluestore:
>>>
>>> https://drive.google.com/file/d/0B2gTBZrkrnpZRFdiYjFRNmxLblU/view?usp=sharing
>>>
>>> As a result of that analysis, we chose to use larger than expected buffers.
>>> The advantage is that the number of compaction events and total amount of
>>> compacted data is greatly reduced in the tests we ran. The downside is that
>>> working on larger memtables is potentially slower.
>>>
>>> Since we did that analysis last fall, we've made a number of changes that
>>> potentially could affect the results. Specifically, we discovered that the
>>> compaction thread was under extremely heavy load, even with large buffers,
>>> doing small sequential reads due to a lack of compaction readahead.  The
>>> compaction thread is much less busy after that fix, so we decided to run a
>>> couple of new, smaller scale tests to verify our original findings.  As
>>> opposed to the previous tests, these tests were only run against a single
>>> OSD and used a larger 512GB RBD volume where not all of the onodes could fit
>>> in the bluestore cache.  Measurements were taken after the volume was
>>> pre-filled with 4MB writes followed by a 5 minute 4k random write workload.
>>>
>>> https://drive.google.com/file/d/0B2gTBZrkrnpZVXpzR2JNRmR0WFE/view?usp=sharing
>>>
>>> In these results, the effect of compaction on client IO is dramatically
>>> lower since the OSD is spending a significant amount of time doing onode
>>> reads from the DB.  Having said that, DB and compaction statistics still
>>> show a dramatic reduction in reads, writes, and write-amp when larger
>>> buffers/memtables are used.
>>>
>>> The question of whether or not the large memtables might be hurting in other
>>> ways remains.  To examine this, additional tests were run, this time with a
>>> smaller 16GB RBD volume so that all onodes stay in cache. 4k random write
>>> tests with 4 256MB and 32MB buffers were compared. Using 256MB buffers
>>> provided around a 10% performance advantage vs 32MB buffers, however tests
>>> with 32MB buffers showed less time spent doing key comparisons when adding
>>> data to the Memtables in kv_sync_thread:
>>>
>>> 32MB buffers:
>>>
>>> 34.45% rocksdb::MemTable::Add
>>> + 20.40% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
>>> const&>::Insert<false>
>>> | + 18.65% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
>>> const&>::RecomputeSpliceLevels
>>> | | + 18.45% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
>>> const&>::FindSpliceForLevel
>>> | |   + 16.00% KeyIsAfterNode
>>> | |   | + 15.60% rocksdb::MemTable::KeyComparator::operator()
>>> | |   |   + 10.25% rocksdb::InternalKeyComparator::Compare
>>> | |   |   | + 6.95% rocksdb::(anonymous
>>> namespace)::BytewiseComparatorImpl::Compare
>>> | |   |   | | + 6.30% compare
>>> | |   |   | |   + 5.55% __memcmp_sse4_1
>>> | |   |   | |   + 0.10% memcmp@plt
>>> | |   |   | + 0.10% ExtractUserKey
>>> | |   |   + 4.00% GetLengthPrefixedSlice
>>> | |   |     + 0.45% GetVarint32Ptr
>>>
>>> 256MB buffers:
>>>
>>> 43.20% rocksdb::MemTable::Add
>>> + 30.85% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
>>> const&>::Insert<false>
>>> | + 29.15% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
>>> const&>::RecomputeSpliceLevels
>>> | | + 28.70% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
>>> const&>::FindSpliceForLevel
>>> | |   + 25.50% KeyIsAfterNode
>>> | |   | + 24.90% rocksdb::MemTable::KeyComparator::operator()
>>> | |   |   + 13.05% rocksdb::InternalKeyComparator::Compare
>>> | |   |   | + 9.35% rocksdb::(anonymous
>>> namespace)::BytewiseComparatorImpl::Compare
>>> | |   |   | | + 8.50% compare
>>> | |   |   | |   + 7.55% __memcmp_sse4_1
>>> | |   |   | |   + 0.30% memcmp@plt
>>> | |   |   | + 0.55% ExtractUserKey
>>> | |   |   + 10.05% GetLengthPrefixedSlice
>>> | |   |     + 2.05% GetVarint32Ptr
>>>
>>> So the takeaway here is that larger buffers still provide a performance
>>> advantage likely due to a much better compaction workload, but are causing
>>> the kv_sync_thread to spend more wallclock time and burn more CPU processing
>>> the memtable in KeyIsAfterNode. Using smaller buffers reduces the time spent
>>> from ~25.5% to ~16% during the actual writes, but results in higher write
>>> amp, more writes, and more reads during compaction.  Ultimately all of this
>>> only really matters if the onodes are kept in cache since cache misses
>>> quickly become a major bottleneck when the cache isn't large enough to hold
>>> all onodes in random write scenarios.
>>>
>>> Mark
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html