Hi all,
Last fall we ran through some tests to try to determine how many and
what size write buffers (ie the memtable size) should be used in rocksdb
for bluestore:
https://drive.google.com/file/d/0B2gTBZrkrnpZRFdiYjFRNmxLblU/view?usp=sharing
As a result of that analysis, we chose to use larger than expected
buffers. The advantage is that the number of compaction events and
total amount of compacted data is greatly reduced in the tests we ran.
The downside is that working on larger memtables is potentially slower.
Since we did that analysis last fall, we've made a number of changes
that potentially could affect the results. Specifically, we discovered
that the compaction thread was under extremely heavy load, even with
large buffers, doing small sequential reads due to a lack of compaction
readahead. The compaction thread is much less busy after that fix, so
we decided to run a couple of new, smaller scale tests to verify our
original findings. As opposed to the previous tests, these tests were
only run against a single OSD and used a larger 512GB RBD volume where
not all of the onodes could fit in the bluestore cache. Measurements
were taken after the volume was pre-filled with 4MB writes followed by a
5 minute 4k random write workload.
https://drive.google.com/file/d/0B2gTBZrkrnpZVXpzR2JNRmR0WFE/view?usp=sharing
In these results, the effect of compaction on client IO is dramatically
lower since the OSD is spending a significant amount of time doing onode
reads from the DB. Having said that, DB and compaction statistics still
show a dramatic reduction in reads, writes, and write-amp when larger
buffers/memtables are used.
The question of whether or not the large memtables might be hurting in
other ways remains. To examine this, additional tests were run, this
time with a smaller 16GB RBD volume so that all onodes stay in cache.
4k random write tests with 4 256MB and 32MB buffers were compared.
Using 256MB buffers provided around a 10% performance advantage vs 32MB
buffers, however tests with 32MB buffers showed less time spent doing
key comparisons when adding data to the Memtables in kv_sync_thread:
32MB buffers:
34.45% rocksdb::MemTable::Add
+ 20.40% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
const&>::Insert<false>
| + 18.65% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
const&>::RecomputeSpliceLevels
| | + 18.45% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
const&>::FindSpliceForLevel
| | + 16.00% KeyIsAfterNode
| | | + 15.60% rocksdb::MemTable::KeyComparator::operator()
| | | + 10.25% rocksdb::InternalKeyComparator::Compare
| | | | + 6.95% rocksdb::(anonymous
namespace)::BytewiseComparatorImpl::Compare
| | | | | + 6.30% compare
| | | | | + 5.55% __memcmp_sse4_1
| | | | | + 0.10% memcmp@plt
| | | | + 0.10% ExtractUserKey
| | | + 4.00% GetLengthPrefixedSlice
| | | + 0.45% GetVarint32Ptr
256MB buffers:
43.20% rocksdb::MemTable::Add
+ 30.85% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
const&>::Insert<false>
| + 29.15% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
const&>::RecomputeSpliceLevels
| | + 28.70% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator
const&>::FindSpliceForLevel
| | + 25.50% KeyIsAfterNode
| | | + 24.90% rocksdb::MemTable::KeyComparator::operator()
| | | + 13.05% rocksdb::InternalKeyComparator::Compare
| | | | + 9.35% rocksdb::(anonymous
namespace)::BytewiseComparatorImpl::Compare
| | | | | + 8.50% compare
| | | | | + 7.55% __memcmp_sse4_1
| | | | | + 0.30% memcmp@plt
| | | | + 0.55% ExtractUserKey
| | | + 10.05% GetLengthPrefixedSlice
| | | + 2.05% GetVarint32Ptr
So the takeaway here is that larger buffers still provide a performance
advantage likely due to a much better compaction workload, but are
causing the kv_sync_thread to spend more wallclock time and burn more
CPU processing the memtable in KeyIsAfterNode. Using smaller buffers
reduces the time spent from ~25.5% to ~16% during the actual writes, but
results in higher write amp, more writes, and more reads during
compaction. Ultimately all of this only really matters if the onodes
are kept in cache since cache misses quickly become a major bottleneck
when the cache isn't large enough to hold all onodes in random write
scenarios.
Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html