Re: Work update related to rocksdb

Mark Nelson <mnelson@xxxxxxxxxx> · Mon, 16 Oct 2017 08:50:58 -0500

On 10/16/2017 08:28 AM, Sage Weil wrote:
[adding ceph-devel]

On Mon, 16 Oct 2017, Mark Nelson wrote:
Hi Lisa,

Excellent testing!   This is exactly what we were trying to understand.

On 10/16/2017 12:55 AM, Li, Xiaoyan wrote:
Hi Mark,

Based on my testing, when setting min_write_buffer_number_to_merge as 2, the
onodes and deferred data written into L0 SST can decreased a lot with my
rocksdb dedup package.

But for omap data, it needs to span more memtables. I tested omap data in
separate column family. From the data, you can see when
min_write_buffer_number_to_merge is set to 4, the data written into L0 SST
is good. That means it has to compare current memTable to flush with later 3
memtables recursively.
kFlushStyleDedup is to new flush style in my rocksdb dedup package.
kFlushStyleMerge is current flush style in master branch.

But this is just considered from data written into L0. With more memtables
to compare, it sacrifices CPU and computing time.

Memtable size: 256MB
max_write_buffer_number	min_write_buffer_number_to_merge
flush_style	Omap data written into L0 SST(unit: MB)
16	8	kFlushStyleMerge	7665
16	8	kFlushStyleDedup	3770
8	4	kFlushStyleMerge	11470
8	4	kFlushStyleDedup	3922
6	3	kFlushStyleMerge	14059
6	3	kFlushStyleDedup	5001
4	2	kFlushStyleMerge	18683
4	2	kFlushStyleDedup	15394

Is this only omap data or all data?  It looks like the 6/3 or 8/4 is still
probably the optimal point (And the improvements are quite noticeable!).
Sadly we were hoping we might be able to get away with smaller memtables (say
64MB) with KFlushStyleDedup.  It looks like that might not be the case unless
we increase the number very high.

Sage, this is going to be even worse if we try to keep more pglog entries
around on flash OSD backends?

I think there are three or more factors at play here:

1- If we reduce the memtable size, the CPU cost of insertion (baseline)
and the dedup cost will go down.

2- If we switch to a small min pg log entries, then most pg log keys
*will* fall into the smaller window (of small memtables * small
min_write_buffer_to_merge).  The dup op keys probably won't, though...
except maybe they will because the values are small and more of them will
fit into the memtables.  But then

3- If we have more keys and smaller values, then the CPU overhead will be
higher again.

For PG logs, I didn't really expect that the dedup style would help; I was
only thinking about the deferred keys.  I wonder if it would make sense to
specify a handful of key prefixes to attempt dedup on, and not bother on
the others?

Deferred keys seem to be a much smaller part of the problem right now 
than pglog.  At least based on what I'm seeing at the moment with NVMe 
testing.  Regarding dedup, it's probably worth testing at the very least.

Also, there is the question of where the CPU time is spent.

Indeed, but if we can reduce the memtable size it means we save CPU in 
other areas.  Like you say below, it's complicated.

1- Big memtables means we spend more time in submit_transaction, called by
the kv_sync_thread, which is a bottleneck.

At least on NVMe we see it pretty regularly in the wallclock traces.  I 
need to retest with Radoslav and Adam's hugepages PR to get a feel for 
how bad it is after that.

2- Higher dedup style flush CPU usage is spent in the compaction thread(s)
(I think?), which are asynchronous.

L0 compaction is single threaded though so we must be careful....

At the end of the day I think we need to use less CPU total, so the
optimization of the above factors is a bit complicated.  OTOH if the goal
is IOPS at whatever cost it'll probably mean a slightly different choice.

I guess we should consider the trends.  Lots of cores, lots of flash 
cells.  How do we balance high throughput and low latency?

I would *expect* that if we go from, say, 256mb tables to 64mb tables and
dedup of <= 4 of them, then we'll see a modest net reduction of total CPU
*and* a shift to the compaction threads.

It seems like based on Lisa's test results that's too short lived? 
Maybe I'm not understanding what you mean?

And changing the pg log min entries will counterintuitively increase the
costs of insertion and dedup flush because more keys will fit in the same
amount of memtable... but if we reduce the memtable size at the same time
we might get a win there too?  Maybe?

There's too much variability here to theorycraft it and your "maybe" 
statement confirms for me. ;)  We need to get a better handle on what's 
going on.

Lisa, do you think limiting the dedup check during flush to specific
prefixes would make sense as a general capability?  If so, we could target
this *just* at the high-value keys (e.g., deferred writes) and avoid
incurring very much additional overhead for the key ranges that aren't
sure bets.

At least in my testing deferred writes during rbd 4k random writes are 
almost negligible:

http://pad.ceph.com/p/performance_weekly

I suspect it's all going to be about OMAP.  We need a really big WAL 
that can keep OMAP around for a long time while quickly flushing object 
data into small memtables.  On disk it's a big deal that this gets layed 
out sequentially but on flash I'm wondering if we'd be better off with a 
separate WAL for OMAP (a different rocksdb shard or different data store 
entirely).

Mark

sage

The above KV operation sequences come from 4k random writes in 30mins.
Overall, the Rocksdb dedup package can decrease the data written into L0
SST, but it needs more comparison. In my opinion, whether to use dedup, it
depends on the configuration of the OSD host: whether disk is over busy or
CPU is over busy.

Do you have any insight into how much CPU overhead it adds?

Best wishes
Lisa
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html