Hello, There was an alternative path calling omap_rmkeys(). Just pushed the amended patch [1]. Big thanks for pointing out the problem! :-) Regards, Radek [1] https://github.com/ceph/ceph/commit/db2ce11e351d0e8ae1edff625e15a2f8ec1151d8 On Fri, Oct 20, 2017 at 10:44 AM, xiaoyan li <wisher2003@xxxxxxxxx> wrote: > Hi, Radek > > I run the commit with the latest master branch, and set > rocksdb_enable_rmrange = true, but from logs it indicates no > rm_range_keys is called. Do you know why? > > https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L10936 > https://github.com/ceph/ceph/blob/master/src/os/bluestore/BlueStore.cc#L11081 > > On Wed, Oct 18, 2017 at 5:59 PM, Radoslaw Zarzynski <rzarzyns@xxxxxxxxxx> wrote: >> Hello Sage, >> >> the patch is at my Github [1]. >> >> Please be aware we also need to set "rocksdb_enable_rmrange = true" to >> use the DeleteRange of rocksdb::WriteBatch's interface. Otherwise our KV >> abstraction layer would translate rm_range_keys() into a sequence of calls >> to Delete(). >> >> Regards, >> Radek >> >> P.S. >> My apologies for duplicating the message. >> >> [1] https://github.com/ceph/ceph/commit/92a28f033a5272b7dc2c5d726e67b6d09f6166ba >> >> On Tue, Oct 17, 2017 at 5:21 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: >>> On Tue, 17 Oct 2017, xiaoyan li wrote: >>>> On Tue, Oct 17, 2017 at 10:49 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: >>>> > On Tue, 17 Oct 2017, xiaoyan li wrote: >>>> >> Hi Sage and Mark, >>>> >> A question here: OMAP pg logs are added by "set", are they only >>>> >> deleted by rm_range_keys in BlueStore? >>>> >> https://github.com/ceph/ceph/pull/18279/files >>>> > >>>> > Ooh, I didn't realize we weren't doing this already--we should definitely >>>> > merge this patch. But: >>>> > >>>> >> If yes, maybe when dedup, we don't need to compare the keys in all >>>> >> memtables, we just compare keys in current memtable with rm_range_keys >>>> >> in later memtables? >>>> > >>>> > They are currently deleted explicitly by key name by the OSD code; it >>>> > doesn't call the range-based delete method. Radoslaw had a test branch >>>> > last week that tried using rm_range_keys instead but he didn't see any >>>> > real difference... presumably because we didn't realize the bluestore omap >>>> > code wasn't passing a range delete down to KeyValuDB! We should retest on >>>> > top of your change. >>>> I will also have a check. >>>> A memtable table includes two parts: key/value operations(set, delete, >>>> deletesingle, merge), and range_del(includes range delete). I am >>>> wondering if all the pg logs are deleted by range delete, we can just >>>> check whether a key/value is deleted in range_del parts of later >>>> memtables when dedup flush, this can be save a lot of comparison >>>> effort. >>> >>> That sounds very promising! Radoslaw, can you share your patch changing >>> the PG log trimming behavior? >>> >>> Thanks! >>> sage >>> >>>> >>>> > >>>> > Thanks! >>>> > sage >>>> > >>>> > >>>> > >>>> > > >>>> >> >>>> >> On Tue, Oct 17, 2017 at 10:18 AM, xiaoyan li <wisher2003@xxxxxxxxx> wrote: >>>> >> > Hi Sage and Mark, >>>> >> > Following tests results I give are tested based on KV sequences got >>>> >> > from librbd+fio 4k or 16k random writes in 30 mins. >>>> >> > In my opinion, we may use dedup flush style for onodes and deferred >>>> >> > data, but use default merge flush style for other data. >>>> >> > >>>> >> > On Mon, Oct 16, 2017 at 9:50 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote: >>>> >> >> >>>> >> >> >>>> >> >> On 10/16/2017 08:28 AM, Sage Weil wrote: >>>> >> >>> >>>> >> >>> [adding ceph-devel] >>>> >> >>> >>>> >> >>> On Mon, 16 Oct 2017, Mark Nelson wrote: >>>> >> >>>> >>>> >> >>>> Hi Lisa, >>>> >> >>>> >>>> >> >>>> Excellent testing! This is exactly what we were trying to understand. >>>> >> >>>> >>>> >> >>>> On 10/16/2017 12:55 AM, Li, Xiaoyan wrote: >>>> >> >>>>> >>>> >> >>>>> Hi Mark, >>>> >> >>>>> >>>> >> >>>>> Based on my testing, when setting min_write_buffer_number_to_merge as 2, >>>> >> >>>>> the >>>> >> >>>>> onodes and deferred data written into L0 SST can decreased a lot with my >>>> >> >>>>> rocksdb dedup package. >>>> >> >>>>> >>>> >> >>>>> But for omap data, it needs to span more memtables. I tested omap data >>>> >> >>>>> in >>>> >> >>>>> separate column family. From the data, you can see when >>>> >> >>>>> min_write_buffer_number_to_merge is set to 4, the data written into L0 >>>> >> >>>>> SST >>>> >> >>>>> is good. That means it has to compare current memTable to flush with >>>> >> >>>>> later 3 >>>> >> >>>>> memtables recursively. >>>> >> >>>>> kFlushStyleDedup is to new flush style in my rocksdb dedup package. >>>> >> >>>>> kFlushStyleMerge is current flush style in master branch. >>>> >> >>>>> >>>> >> >>>>> But this is just considered from data written into L0. With more >>>> >> >>>>> memtables >>>> >> >>>>> to compare, it sacrifices CPU and computing time. >>>> >> >>>>> >>>> >> >>>>> Memtable size: 256MB >>>> >> >>>>> max_write_buffer_number min_write_buffer_number_to_merge >>>> >> >>>>> flush_style Omap data written into L0 SST(unit: MB) >>>> >> >>>>> 16 8 kFlushStyleMerge 7665 >>>> >> >>>>> 16 8 kFlushStyleDedup 3770 >>>> >> >>>>> 8 4 kFlushStyleMerge 11470 >>>> >> >>>>> 8 4 kFlushStyleDedup 3922 >>>> >> >>>>> 6 3 kFlushStyleMerge 14059 >>>> >> >>>>> 6 3 kFlushStyleDedup 5001 >>>> >> >>>>> 4 2 kFlushStyleMerge 18683 >>>> >> >>>>> 4 2 kFlushStyleDedup 15394 >>>> >> >>>> >>>> >> >>>> >>>> >> >>>> Is this only omap data or all data? It looks like the 6/3 or 8/4 is >>>> >> >>>> still >>>> >> >>>> probably the optimal point (And the improvements are quite noticeable!). >>>> >> > This is only omap data. Dedup can decrease data written into L0 SST, >>>> >> > but it needs to compare too many memtables. >>>> >> > >>>> >> >>>> Sadly we were hoping we might be able to get away with smaller memtables >>>> >> >>>> (say >>>> >> >>>> 64MB) with KFlushStyleDedup. It looks like that might not be the case >>>> >> >>>> unless >>>> >> >>>> we increase the number very high. >>>> >> >>>> >>>> >> >>>> Sage, this is going to be even worse if we try to keep more pglog entries >>>> >> >>>> around on flash OSD backends? >>>> >> >>> >>>> >> >>> >>>> >> >>> I think there are three or more factors at play here: >>>> >> >>> >>>> >> >>> 1- If we reduce the memtable size, the CPU cost of insertion (baseline) >>>> >> >>> and the dedup cost will go down. >>>> >> >>> >>>> >> >>> 2- If we switch to a small min pg log entries, then most pg log keys >>>> >> >>> *will* fall into the smaller window (of small memtables * small >>>> >> >>> min_write_buffer_to_merge). The dup op keys probably won't, though... >>>> >> >>> except maybe they will because the values are small and more of them will >>>> >> >>> fit into the memtables. But then >>>> >> >>> >>>> >> >>> 3- If we have more keys and smaller values, then the CPU overhead will be >>>> >> >>> higher again. >>>> >> >>> >>>> >> >>> For PG logs, I didn't really expect that the dedup style would help; I was >>>> >> >>> only thinking about the deferred keys. I wonder if it would make sense to >>>> >> >>> specify a handful of key prefixes to attempt dedup on, and not bother on >>>> >> >>> the others? >>>> >> >> >>>> >> >> >>>> >> >> Deferred keys seem to be a much smaller part of the problem right now than >>>> >> >> pglog. At least based on what I'm seeing at the moment with NVMe testing. >>>> >> >> Regarding dedup, it's probably worth testing at the very least. >>>> >> > I did following tests: all data in default column family. Set >>>> >> > min_write_buffer_to_merge to 2, check the size of kinds of data >>>> >> > written into L0 SST files. >>>> >> > From the data, onodes and deferred data can be removed a lot in dedup style. >>>> >> > >>>> >> > Data written into L0 SST files: >>>> >> > >>>> >> > 4k random writes (unit: MB) >>>> >> > FlushStyle Omap onodes deferred others >>>> >> > merge 22431.56 23224.54 1530.105 0.906106 >>>> >> > dedup 22188.28 14161.18 12.68681 0.90906 >>>> >> > >>>> >> > 16k random writes (unit: MB) >>>> >> > FlushStyle Omap onodes deferred others >>>> >> > merge 19260.20 8230.02 0 1914.50 >>>> >> > dedup 19154.92 2603.90 0 >>>> >> > 2517.15 >>>> >> > >>>> >> > Note here: for others type, which use "merge" operation, dedup style >>>> >> > can't make it more efficient. In later, we can set it in separate CF, >>>> >> > use default merge flush style. >>>> >> > >>>> >> >> >>>> >> >>> >>>> >> >>> Also, there is the question of where the CPU time is spent. >>>> >> >> >>>> >> >> >>>> >> >> Indeed, but if we can reduce the memtable size it means we save CPU in other >>>> >> >> areas. Like you say below, it's complicated. >>>> >> >>> >>>> >> >>> >>>> >> >>> 1- Big memtables means we spend more time in submit_transaction, called by >>>> >> >>> the kv_sync_thread, which is a bottleneck. >>>> >> >> >>>> >> >> >>>> >> >> At least on NVMe we see it pretty regularly in the wallclock traces. I need >>>> >> >> to retest with Radoslav and Adam's hugepages PR to get a feel for how bad it >>>> >> >> is after that. >>>> >> >> >>>> >> >>> >>>> >> >>> 2- Higher dedup style flush CPU usage is spent in the compaction thread(s) >>>> >> >>> (I think?), which are asynchronous. >>>> >> >> >>>> >> >> >>>> >> >> L0 compaction is single threaded though so we must be careful.... >>>> >> >> >>>> >> >>> >>>> >> >>> At the end of the day I think we need to use less CPU total, so the >>>> >> >>> optimization of the above factors is a bit complicated. OTOH if the goal >>>> >> >>> is IOPS at whatever cost it'll probably mean a slightly different choice. >>>> >> >> >>>> >> >> >>>> >> >> I guess we should consider the trends. Lots of cores, lots of flash cells. >>>> >> >> How do we balance high throughput and low latency? >>>> >> >> >>>> >> >>> >>>> >> >>> I would *expect* that if we go from, say, 256mb tables to 64mb tables and >>>> >> >>> dedup of <= 4 of them, then we'll see a modest net reduction of total CPU >>>> >> >>> *and* a shift to the compaction threads. >>>> >> >> >>>> >> >> >>>> >> >> It seems like based on Lisa's test results that's too short lived? Maybe I'm >>>> >> >> not understanding what you mean? >>>> >> >> >>>> >> >>> >>>> >> >>> And changing the pg log min entries will counterintuitively increase the >>>> >> >>> costs of insertion and dedup flush because more keys will fit in the same >>>> >> >>> amount of memtable... but if we reduce the memtable size at the same time >>>> >> >>> we might get a win there too? Maybe? >>>> >> >> >>>> >> >> >>>> >> >> There's too much variability here to theorycraft it and your "maybe" >>>> >> >> statement confirms for me. ;) We need to get a better handle on what's >>>> >> >> going on. >>>> >> >> >>>> >> >>> >>>> >> >>> Lisa, do you think limiting the dedup check during flush to specific >>>> >> >>> prefixes would make sense as a general capability? If so, we could target >>>> >> >>> this *just* at the high-value keys (e.g., deferred writes) and avoid >>>> >> >>> incurring very much additional overhead for the key ranges that aren't >>>> >> >>> sure bets. >>>> >> > The easiest way to do it is to set data in different CFs, and use >>>> >> > different flush style(dedup or merge) in different CFs. >>>> >> > >>>> >> >> >>>> >> >> >>>> >> >> At least in my testing deferred writes during rbd 4k random writes are >>>> >> >> almost negligible: >>>> >> >> >>>> >> >> http://pad.ceph.com/p/performance_weekly >>>> >> >> >>>> >> >> I suspect it's all going to be about OMAP. We need a really big WAL that >>>> >> >> can keep OMAP around for a long time while quickly flushing object data into >>>> >> >> small memtables. On disk it's a big deal that this gets layed out >>>> >> >> sequentially but on flash I'm wondering if we'd be better off with a >>>> >> >> separate WAL for OMAP (a different rocksdb shard or different data store >>>> >> >> entirely). >>>> >> > Yes, OMAP data is main data written into L0 SST. >>>> >> > >>>> >> > Data written into every memtable: (uint: MB) >>>> >> > IO load omap ondes deferred others >>>> >> > 4k RW 37584 85253 323887 250 >>>> >> > 16k RW 33687 73458 0 3500 >>>> >> > >>>> >> > In merge flush style with min_buffer_to_merge=2. >>>> >> > Data written into every L0 SST: (unit MB) >>>> >> > IO load Omap onodes deferred others >>>> >> > 4k RW 22188.28 14161.18 12.68681 0.90906 >>>> >> > 16k RW 19260.20 8230.02 0 1914.50 >>>> >> > >>>> >> >> >>>> >> >> Mark >>>> >> >> >>>> >> >> >>>> >> >>> >>>> >> >>> sage >>>> >> >>> >>>> >> >>> >>>> >> >>>>> The above KV operation sequences come from 4k random writes in 30mins. >>>> >> >>>>> Overall, the Rocksdb dedup package can decrease the data written into L0 >>>> >> >>>>> SST, but it needs more comparison. In my opinion, whether to use dedup, >>>> >> >>>>> it >>>> >> >>>>> depends on the configuration of the OSD host: whether disk is over busy >>>> >> >>>>> or >>>> >> >>>>> CPU is over busy. >>>> >> >>>> >>>> >> >>>> >>>> >> >>>> Do you have any insight into how much CPU overhead it adds? >>>> >> >>>> >>>> >> >>>>> >>>> >> >>>>> Best wishes >>>> >> >>>>> Lisa >>>> >> >> >>>> >> >> -- >>>> >> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >> > >>>> >> > >>>> >> > >>>> >> > -- >>>> >> > Best wishes >>>> >> > Lisa >>>> >> >>>> >> >>>> >> >>>> >> -- >>>> >> Best wishes >>>> >> Lisa >>>> >> >>>> >> >>>> >>>> >>>> >>>> -- >>>> Best wishes >>>> Lisa >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> > > > > -- > Best wishes > Lisa -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html