Re: storing pg logs outside of rocksdb

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 04/03/2018 12:37 AM, xiaoyan li wrote:
On Tue, Apr 3, 2018 at 12:50 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
Indeed.  There was a moderate performance improvement (may 10-20%) but a
dramatic reduction in CPU overhead.  Note however that bluestore/rocksdb
will likely show different bottlenecks and performance implications than
petstore did.
May I ask what petstore is?

Right now it's basically just memstore with vector based objects and a couple of other modifications. Eventually I want to make it into sort of a plug-and-play objectstore where you can mix and match different data and metadata storage options for testing.

Mark



Mark



On 04/02/2018 11:03 PM, Varada Kari (System Engineer) wrote:

I think Mark tested with MemStore. Should be there in one of the
performance meetings notes with the results and link. Please check for
PetStore.

Varada

On Tue, Apr 3, 2018 at 9:15 AM, Li Wang <laurence.liwang@xxxxxxxxx> wrote:

Hi,
    Before we move forward, could someone give a test such that
the pglog not written into rocksdb at all, to see how much is the
performance improvement as the upper bound, it shoule be less than
turning on the bluestore_debug_omit_kv_commit

Cheers,
Li Wang

2018-04-02 13:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>:

Hi all,

Based on your above discussion about pglog, I have the following rough
design. Please help to give your suggestions.

There will be three partitions: raw part for customer IOs, Bluefs for
Rocksdb, and pglog partition.
The former two partitions are same as current. The pglog partition is
splitted into 1M blocks. We allocate blocks for ring buffers per pg.
We will have such following data:

Allocation bitmap (just in memory)

The pglog partition has a bitmap to record which block is allocated or
not. We can rebuild it through pg->allocated_block_list when starting,
and no need to store it in persistent disk. But we will store basic
information about the pglog partition in Rocksdb, like block size,
block number etc when the objectstore is initialized.

Pg -> allocated_blocks_list

When a pg is created and IOs start, we can allocate a block for every
pg. Every pglog entry is less than 300 bytes, 1M can store 3495
entries. When total pglog entries increase and exceed the number, we
can add a new block to the pg.

Pg->start_position

Record the oldest valid entry per pg.

Pg->next_position

Record the next entry to add per pg. The data will be updated
frequently, but Rocksdb is suitable for its io mode, and most of
data will be merged.

Updated Bluestore write progess:

When writing data to disk (before metadata updating), we can append
the pglog entry to its ring buffer in parallel.
After that, submit pg ring buffer changes like pg->next_position, and
current other metadata changes to Rocksdb.


On Fri, Mar 30, 2018 at 6:23 PM, Varada Kari <varada.kari@xxxxxxxxx>
wrote:

On Fri, Mar 30, 2018 at 1:01 PM, Li Wang <laurence.liwang@xxxxxxxxx>
wrote:

Hi,
    If we wanna store pg log in a standalone ring buffer, another
candidate
is the deferred write, why not use the ring buffer as the journal for
4K random
write, it should be much more lightweight than rocksdb

It will be similar to FileStore implementation, for small writes. That
comes with the same alignment issues and given
write amplification. Rocksdb nicely abstracts that and we don't make
it to L0 files because of WAL handling.

Varada

Cheers,
Li Wang


2018-03-30 4:04 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>:

On Wed, 28 Mar 2018, Matt Benjamin wrote:

On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson <mnelson@xxxxxxxxxx>
wrote:

On 03/28/2018 12:21 PM, Adam C. Emerson wrote:

2) It sure feels like conceptually the pglog should be represented
as a
per-pg ring buffer rather than key/value data.  Maybe there are
really
important reasons that it shouldn't be, but I don't currently see
them.  As
far as the objectstore is concerned, it seems to me like there are
valid
reasons to provide some kind of log interface and perhaps that
should be
used for pg_log.  That sort of opens the door for different object
store
implementations fulfilling that functionality in whatever ways the
author
deems fit.

In the reddit lingo, pretty much this.  We should be concentrating
on
this direction, or ruling it out.

Yeah, +1

It seems like step 1 is a proof of concept branch that encodes
pg_log_entry_t's and writes them to a simple ring buffer.  The first
questions to answer is (a) whether this does in fact improve things
significantly and (b) whether we want to have an independent ring
buffer
for each PG or try to mix them into one big one for the whole OSD (or
maybe per shard).

The second question is how that fares on HDDs.  My guess is that the
current rocksdb strategy is better because it reduces the number of
IOs
and the additional data getting compacted (and CPU usage) isn't the
limiting factor on HDD perforamnce (IOPS are).  (But maybe we'll get
lucky
and the new strategy will be best for both HDD and SSD..)

Then we have to modify PGLog to be a complete implementation.  A
strict
ring buffer probably won't work because the PG log might not trim and
because log entries are variable length, so there'll probably need to
be
some simple mapping table (vs a trivial start/end ring buffer
position) to
deal with that.  We have to trim the log periodically, so every so
many
entries we may want to realign with a min_alloc_size boundary.  We
someones have to back up and rewrite divergent portions of the log
(during
peering) so we'll need to sort out whether that is a complete
reencode/rewrite or whether we keep encoded entries in ram
(individually
or in chunks), etc etc.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
Best wishes
Lisa

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux