Re: storing pg logs outside of rocksdb

Sage Weil <sage@xxxxxxxxxxxx> · Tue, 3 Apr 2018 13:27:08 +0000 (UTC)



On Tue, 3 Apr 2018, Li Wang wrote:
> Hi,
>   Before we move forward, could someone give a test such that
> the pglog not written into rocksdb at all, to see how much is the
> performance improvement as the upper bound, it shoule be less than
> turning on the bluestore_debug_omit_kv_commit

+1

(The PetStore behavior doesn't tell us anything about how BlueStore will 
behave without the pglog overhead.)

sage


> 
> Cheers,
> Li Wang
> 
> 2018-04-02 13:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>:
> > Hi all,
> >
> > Based on your above discussion about pglog, I have the following rough
> > design. Please help to give your suggestions.
> >
> > There will be three partitions: raw part for customer IOs, Bluefs for
> > Rocksdb, and pglog partition.
> > The former two partitions are same as current. The pglog partition is
> > splitted into 1M blocks. We allocate blocks for ring buffers per pg.
> > We will have such following data:
> >
> > Allocation bitmap (just in memory)
> >
> > The pglog partition has a bitmap to record which block is allocated or
> > not. We can rebuild it through pg->allocated_block_list when starting,
> > and no need to store it in persistent disk. But we will store basic
> > information about the pglog partition in Rocksdb, like block size,
> > block number etc when the objectstore is initialized.
> >
> > Pg -> allocated_blocks_list
> >
> > When a pg is created and IOs start, we can allocate a block for every
> > pg. Every pglog entry is less than 300 bytes, 1M can store 3495
> > entries. When total pglog entries increase and exceed the number, we
> > can add a new block to the pg.
> >
> > Pg->start_position
> >
> > Record the oldest valid entry per pg.
> >
> > Pg->next_position
> >
> > Record the next entry to add per pg. The data will be updated
> > frequently, but Rocksdb is suitable for its io mode, and most of
> > data will be merged.
> >
> > Updated Bluestore write progess:
> >
> > When writing data to disk (before metadata updating), we can append
> > the pglog entry to its ring buffer in parallel.
> > After that, submit pg ring buffer changes like pg->next_position, and
> > current other metadata changes to Rocksdb.
> >
> >
> > On Fri, Mar 30, 2018 at 6:23 PM, Varada Kari <varada.kari@xxxxxxxxx> wrote:
> >> On Fri, Mar 30, 2018 at 1:01 PM, Li Wang <laurence.liwang@xxxxxxxxx> wrote:
> >>> Hi,
> >>>   If we wanna store pg log in a standalone ring buffer, another candidate
> >>> is the deferred write, why not use the ring buffer as the journal for 4K random
> >>> write, it should be much more lightweight than rocksdb
> >>>
> >> It will be similar to FileStore implementation, for small writes. That
> >> comes with the same alignment issues and given
> >> write amplification. Rocksdb nicely abstracts that and we don't make
> >> it to L0 files because of WAL handling.
> >>
> >> Varada
> >>> Cheers,
> >>> Li Wang
> >>>
> >>>
> >>> 2018-03-30 4:04 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>:
> >>>> On Wed, 28 Mar 2018, Matt Benjamin wrote:
> >>>>> On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> >>>>> > On 03/28/2018 12:21 PM, Adam C. Emerson wrote:
> >>>>> >
> >>>>> > 2) It sure feels like conceptually the pglog should be represented as a
> >>>>> > per-pg ring buffer rather than key/value data.  Maybe there are really
> >>>>> > important reasons that it shouldn't be, but I don't currently see them.  As
> >>>>> > far as the objectstore is concerned, it seems to me like there are valid
> >>>>> > reasons to provide some kind of log interface and perhaps that should be
> >>>>> > used for pg_log.  That sort of opens the door for different object store
> >>>>> > implementations fulfilling that functionality in whatever ways the author
> >>>>> > deems fit.
> >>>>>
> >>>>> In the reddit lingo, pretty much this.  We should be concentrating on
> >>>>> this direction, or ruling it out.
> >>>>
> >>>> Yeah, +1
> >>>>
> >>>> It seems like step 1 is a proof of concept branch that encodes
> >>>> pg_log_entry_t's and writes them to a simple ring buffer.  The first
> >>>> questions to answer is (a) whether this does in fact improve things
> >>>> significantly and (b) whether we want to have an independent ring buffer
> >>>> for each PG or try to mix them into one big one for the whole OSD (or
> >>>> maybe per shard).
> >>>>
> >>>> The second question is how that fares on HDDs.  My guess is that the
> >>>> current rocksdb strategy is better because it reduces the number of IOs
> >>>> and the additional data getting compacted (and CPU usage) isn't the
> >>>> limiting factor on HDD perforamnce (IOPS are).  (But maybe we'll get lucky
> >>>> and the new strategy will be best for both HDD and SSD..)
> >>>>
> >>>> Then we have to modify PGLog to be a complete implementation.  A strict
> >>>> ring buffer probably won't work because the PG log might not trim and
> >>>> because log entries are variable length, so there'll probably need to be
> >>>> some simple mapping table (vs a trivial start/end ring buffer position) to
> >>>> deal with that.  We have to trim the log periodically, so every so many
> >>>> entries we may want to realign with a min_alloc_size boundary.  We
> >>>> someones have to back up and rewrite divergent portions of the log (during
> >>>> peering) so we'll need to sort out whether that is a complete
> >>>> reencode/rewrite or whether we keep encoded entries in ram (individually
> >>>> or in chunks), etc etc.
> >>>>
> >>>> sage
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > Best wishes
> > Lisa
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html