Re: storing pg logs outside of rocksdb

xiaoyan li <wisher2003@xxxxxxxxx> · Tue, 3 Apr 2018 13:37:29 +0800



On Tue, Apr 3, 2018 at 12:50 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> Indeed.  There was a moderate performance improvement (may 10-20%) but a
> dramatic reduction in CPU overhead.  Note however that bluestore/rocksdb
> will likely show different bottlenecks and performance implications than
> petstore did.
May I ask what petstore is?

>
> Mark
>
>
>
> On 04/02/2018 11:03 PM, Varada Kari (System Engineer) wrote:
>>
>> I think Mark tested with MemStore. Should be there in one of the
>> performance meetings notes with the results and link. Please check for
>> PetStore.
>>
>> Varada
>>
>> On Tue, Apr 3, 2018 at 9:15 AM, Li Wang <laurence.liwang@xxxxxxxxx> wrote:
>>>
>>> Hi,
>>>    Before we move forward, could someone give a test such that
>>> the pglog not written into rocksdb at all, to see how much is the
>>> performance improvement as the upper bound, it shoule be less than
>>> turning on the bluestore_debug_omit_kv_commit
>>>
>>> Cheers,
>>> Li Wang
>>>
>>> 2018-04-02 13:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>:
>>>>
>>>> Hi all,
>>>>
>>>> Based on your above discussion about pglog, I have the following rough
>>>> design. Please help to give your suggestions.
>>>>
>>>> There will be three partitions: raw part for customer IOs, Bluefs for
>>>> Rocksdb, and pglog partition.
>>>> The former two partitions are same as current. The pglog partition is
>>>> splitted into 1M blocks. We allocate blocks for ring buffers per pg.
>>>> We will have such following data:
>>>>
>>>> Allocation bitmap (just in memory)
>>>>
>>>> The pglog partition has a bitmap to record which block is allocated or
>>>> not. We can rebuild it through pg->allocated_block_list when starting,
>>>> and no need to store it in persistent disk. But we will store basic
>>>> information about the pglog partition in Rocksdb, like block size,
>>>> block number etc when the objectstore is initialized.
>>>>
>>>> Pg -> allocated_blocks_list
>>>>
>>>> When a pg is created and IOs start, we can allocate a block for every
>>>> pg. Every pglog entry is less than 300 bytes, 1M can store 3495
>>>> entries. When total pglog entries increase and exceed the number, we
>>>> can add a new block to the pg.
>>>>
>>>> Pg->start_position
>>>>
>>>> Record the oldest valid entry per pg.
>>>>
>>>> Pg->next_position
>>>>
>>>> Record the next entry to add per pg. The data will be updated
>>>> frequently, but Rocksdb is suitable for its io mode, and most of
>>>> data will be merged.
>>>>
>>>> Updated Bluestore write progess:
>>>>
>>>> When writing data to disk (before metadata updating), we can append
>>>> the pglog entry to its ring buffer in parallel.
>>>> After that, submit pg ring buffer changes like pg->next_position, and
>>>> current other metadata changes to Rocksdb.
>>>>
>>>>
>>>> On Fri, Mar 30, 2018 at 6:23 PM, Varada Kari <varada.kari@xxxxxxxxx>
>>>> wrote:
>>>>>
>>>>> On Fri, Mar 30, 2018 at 1:01 PM, Li Wang <laurence.liwang@xxxxxxxxx>
>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>    If we wanna store pg log in a standalone ring buffer, another
>>>>>> candidate
>>>>>> is the deferred write, why not use the ring buffer as the journal for
>>>>>> 4K random
>>>>>> write, it should be much more lightweight than rocksdb
>>>>>>
>>>>> It will be similar to FileStore implementation, for small writes. That
>>>>> comes with the same alignment issues and given
>>>>> write amplification. Rocksdb nicely abstracts that and we don't make
>>>>> it to L0 files because of WAL handling.
>>>>>
>>>>> Varada
>>>>>>
>>>>>> Cheers,
>>>>>> Li Wang
>>>>>>
>>>>>>
>>>>>> 2018-03-30 4:04 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>:
>>>>>>>
>>>>>>> On Wed, 28 Mar 2018, Matt Benjamin wrote:
>>>>>>>>
>>>>>>>> On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson <mnelson@xxxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On 03/28/2018 12:21 PM, Adam C. Emerson wrote:
>>>>>>>>>
>>>>>>>>> 2) It sure feels like conceptually the pglog should be represented
>>>>>>>>> as a
>>>>>>>>> per-pg ring buffer rather than key/value data.  Maybe there are
>>>>>>>>> really
>>>>>>>>> important reasons that it shouldn't be, but I don't currently see
>>>>>>>>> them.  As
>>>>>>>>> far as the objectstore is concerned, it seems to me like there are
>>>>>>>>> valid
>>>>>>>>> reasons to provide some kind of log interface and perhaps that
>>>>>>>>> should be
>>>>>>>>> used for pg_log.  That sort of opens the door for different object
>>>>>>>>> store
>>>>>>>>> implementations fulfilling that functionality in whatever ways the
>>>>>>>>> author
>>>>>>>>> deems fit.
>>>>>>>>
>>>>>>>> In the reddit lingo, pretty much this.  We should be concentrating
>>>>>>>> on
>>>>>>>> this direction, or ruling it out.
>>>>>>>
>>>>>>> Yeah, +1
>>>>>>>
>>>>>>> It seems like step 1 is a proof of concept branch that encodes
>>>>>>> pg_log_entry_t's and writes them to a simple ring buffer.  The first
>>>>>>> questions to answer is (a) whether this does in fact improve things
>>>>>>> significantly and (b) whether we want to have an independent ring
>>>>>>> buffer
>>>>>>> for each PG or try to mix them into one big one for the whole OSD (or
>>>>>>> maybe per shard).
>>>>>>>
>>>>>>> The second question is how that fares on HDDs.  My guess is that the
>>>>>>> current rocksdb strategy is better because it reduces the number of
>>>>>>> IOs
>>>>>>> and the additional data getting compacted (and CPU usage) isn't the
>>>>>>> limiting factor on HDD perforamnce (IOPS are).  (But maybe we'll get
>>>>>>> lucky
>>>>>>> and the new strategy will be best for both HDD and SSD..)
>>>>>>>
>>>>>>> Then we have to modify PGLog to be a complete implementation.  A
>>>>>>> strict
>>>>>>> ring buffer probably won't work because the PG log might not trim and
>>>>>>> because log entries are variable length, so there'll probably need to
>>>>>>> be
>>>>>>> some simple mapping table (vs a trivial start/end ring buffer
>>>>>>> position) to
>>>>>>> deal with that.  We have to trim the log periodically, so every so
>>>>>>> many
>>>>>>> entries we may want to realign with a min_alloc_size boundary.  We
>>>>>>> someones have to back up and rewrite divergent portions of the log
>>>>>>> (during
>>>>>>> peering) so we'll need to sort out whether that is a complete
>>>>>>> reencode/rewrite or whether we keep encoded entries in ram
>>>>>>> (individually
>>>>>>> or in chunks), etc etc.
>>>>>>>
>>>>>>> sage
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>> in
>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>> --
>>>> Best wishes
>>>> Lisa
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


-- 
Best wishes
Lisa
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html