2017-06-06 23:29 GMT+08:00 Haomai Wang <haomai@xxxxxxxx>: > On Tue, Jun 6, 2017 at 11:23 PM, Xiaoxi Chen <superdebuger@xxxxxxxxx> wrote: >> Hi Mark, >> Cool data, wondering if we take the compaction wallclock time >> into account, is 32MB memtable still showing advantage on CPU >> consumption? I am expecting 256MB is better as compaction workload >> reduced significantly. > > I don't have the real data about the different, but from the > implementation the larger table size, it will cause worser internal > fragment. If want to achieve the same live utilization compared to > smaller table size, it need to do more compaction traffic than > smaller. > I totally hear your point. But compaction is not only from L0 to L1, it is in all leve. Merging short live KV creation/deletion and several update against same KV at upper level (L0) should be more efficient than do it later on lower level, isnt it? > Hmm, I'm not sure my description answer the question.... > >> >> Xiaoxi >> >> 2017-06-02 21:48 GMT+08:00 Mark Nelson <mnelson@xxxxxxxxxx>: >>> Hi all, >>> >>> Last fall we ran through some tests to try to determine how many and what >>> size write buffers (ie the memtable size) should be used in rocksdb for >>> bluestore: >>> >>> https://drive.google.com/file/d/0B2gTBZrkrnpZRFdiYjFRNmxLblU/view?usp=sharing >>> >>> As a result of that analysis, we chose to use larger than expected buffers. >>> The advantage is that the number of compaction events and total amount of >>> compacted data is greatly reduced in the tests we ran. The downside is that >>> working on larger memtables is potentially slower. >>> >>> Since we did that analysis last fall, we've made a number of changes that >>> potentially could affect the results. Specifically, we discovered that the >>> compaction thread was under extremely heavy load, even with large buffers, >>> doing small sequential reads due to a lack of compaction readahead. The >>> compaction thread is much less busy after that fix, so we decided to run a >>> couple of new, smaller scale tests to verify our original findings. As >>> opposed to the previous tests, these tests were only run against a single >>> OSD and used a larger 512GB RBD volume where not all of the onodes could fit >>> in the bluestore cache. Measurements were taken after the volume was >>> pre-filled with 4MB writes followed by a 5 minute 4k random write workload. >>> >>> https://drive.google.com/file/d/0B2gTBZrkrnpZVXpzR2JNRmR0WFE/view?usp=sharing >>> >>> In these results, the effect of compaction on client IO is dramatically >>> lower since the OSD is spending a significant amount of time doing onode >>> reads from the DB. Having said that, DB and compaction statistics still >>> show a dramatic reduction in reads, writes, and write-amp when larger >>> buffers/memtables are used. >>> >>> The question of whether or not the large memtables might be hurting in other >>> ways remains. To examine this, additional tests were run, this time with a >>> smaller 16GB RBD volume so that all onodes stay in cache. 4k random write >>> tests with 4 256MB and 32MB buffers were compared. Using 256MB buffers >>> provided around a 10% performance advantage vs 32MB buffers, however tests >>> with 32MB buffers showed less time spent doing key comparisons when adding >>> data to the Memtables in kv_sync_thread: >>> >>> 32MB buffers: >>> >>> 34.45% rocksdb::MemTable::Add >>> + 20.40% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator >>> const&>::Insert<false> >>> | + 18.65% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator >>> const&>::RecomputeSpliceLevels >>> | | + 18.45% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator >>> const&>::FindSpliceForLevel >>> | | + 16.00% KeyIsAfterNode >>> | | | + 15.60% rocksdb::MemTable::KeyComparator::operator() >>> | | | + 10.25% rocksdb::InternalKeyComparator::Compare >>> | | | | + 6.95% rocksdb::(anonymous >>> namespace)::BytewiseComparatorImpl::Compare >>> | | | | | + 6.30% compare >>> | | | | | + 5.55% __memcmp_sse4_1 >>> | | | | | + 0.10% memcmp@plt >>> | | | | + 0.10% ExtractUserKey >>> | | | + 4.00% GetLengthPrefixedSlice >>> | | | + 0.45% GetVarint32Ptr >>> >>> 256MB buffers: >>> >>> 43.20% rocksdb::MemTable::Add >>> + 30.85% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator >>> const&>::Insert<false> >>> | + 29.15% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator >>> const&>::RecomputeSpliceLevels >>> | | + 28.70% rocksdb::InlineSkipList<rocksdb::MemTableRep::KeyComparator >>> const&>::FindSpliceForLevel >>> | | + 25.50% KeyIsAfterNode >>> | | | + 24.90% rocksdb::MemTable::KeyComparator::operator() >>> | | | + 13.05% rocksdb::InternalKeyComparator::Compare >>> | | | | + 9.35% rocksdb::(anonymous >>> namespace)::BytewiseComparatorImpl::Compare >>> | | | | | + 8.50% compare >>> | | | | | + 7.55% __memcmp_sse4_1 >>> | | | | | + 0.30% memcmp@plt >>> | | | | + 0.55% ExtractUserKey >>> | | | + 10.05% GetLengthPrefixedSlice >>> | | | + 2.05% GetVarint32Ptr >>> >>> So the takeaway here is that larger buffers still provide a performance >>> advantage likely due to a much better compaction workload, but are causing >>> the kv_sync_thread to spend more wallclock time and burn more CPU processing >>> the memtable in KeyIsAfterNode. Using smaller buffers reduces the time spent >>> from ~25.5% to ~16% during the actual writes, but results in higher write >>> amp, more writes, and more reads during compaction. Ultimately all of this >>> only really matters if the onodes are kept in cache since cache misses >>> quickly become a major bottleneck when the cache isn't large enough to hold >>> all onodes in random write scenarios. >>> >>> Mark >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html