Re: storing pg logs outside of rocksdb

xiaoyan li <wisher2003@xxxxxxxxx> · Wed, 4 Apr 2018 10:36:05 +0800



On Tue, Apr 3, 2018 at 11:15 PM, Mark Nelson <mark.a.nelson@xxxxxxxxx> wrote:
>
>
> On 04/03/2018 09:56 AM, Mark Nelson wrote:
>>
>>
>>
>> On 04/03/2018 08:27 AM, Sage Weil wrote:
>>>
>>> On Tue, 3 Apr 2018, Li Wang wrote:
>>>>
>>>> Hi,
>>>>    Before we move forward, could someone give a test such that
>>>> the pglog not written into rocksdb at all, to see how much is the
>>>> performance improvement as the upper bound, it shoule be less than
>>>> turning on the bluestore_debug_omit_kv_commit
>>>
>>> +1
>>>
>>> (The PetStore behavior doesn't tell us anything about how BlueStore will
>>> behave without the pglog overhead.)
>>>
>>> sage
>>
>>
>> We do have some testing of the bluestore's behavior, though it's about 6
>> months old now:
>>
>> - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD
>>
>> - 128 PGs
>>
>> - stats are sloppy since they only appear every ~10 mins
>>
>> *- default min_pg_log_entries = 1500, trim = default, iops = 26.6K*
>>
>>    - Default CF - Size:  65.63MB, KeyIn:  22M, KeyDrop:  17M, Flush:
>> 7.858GB
>>
>>    - [M] CF     - Size: 118.09MB, KeyIn: 302M, KeyDrop: 269M, Flush:
>> 15.847GB <-- with this workload this is pg log and dup op kv entries
>>
>>    - [L] CF     - Size:   1.00MB, KeyIn: 181K, KeyDrop:  80K, Flush:
>> 0.320GB <-- deferred writes*- min_pg_log_entries = 10, trim = 10, iops =
>> 24.2K*
>>
>>    - Default CF - Size:  23.15MB, KeyIn:  21M, KeyDrop:  16M, Flush:
>> 7.538GB
>>
>>    - [M] CF     - Size:  60.89MB, KeyIn: 277M, KeyDrop: 250M, Flush:
>> 8.884GB <-- with this workload this is pg log and dup op kv entries
>>
>>    - [L] CF     - Size:   1.12MB, KeyIn: 188K, KeyDrop:  83K, Flush:
>> 0.331GB <-- deferred writes - min_pg_log_entries = 1, trim = 1, *iops =
>> 23.8K*
>>
>>    - Default CF - Size:  68.58MB, KeyIn:  22M, KeyDrop:  17M, Flush:
>> 7.936GB
>>
>>    - [M] CF     - Size:  96.39MB, KeyIn: 302M, KeyDrop: 262M, Flush:
>> 9.289GB <-- with this workload this is pg log and dup op kv entries
>>
>>    - [L] CF     - Size:   1.04MB, KeyIn: 209K, KeyDrop:  92K, Flush:
>> 0.368GB <-- deferred writes
>>
>> - min_pg_log_entires = 3000, trim = 1, *iops = 25.8K*
>>
>> *
>> The actual performance variation here I think is much less important than
>> the KeyIn behavior.  The NVMe devices in these tests are fast enough to
>> absorb a fair amount of overhead.
>
>
> Ugh, sorry.  That will teach me to talk in meeting and paste at the same
> time.  Those were the wrong stats.  Here are the right ones:
>
>>         - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD
>>         - 128 PGs
>>         - stats are sloppy since they only appear every ~10 mins
>>         - min_pg_log_entries = 3000, trim = default, pginfo hack, iops =
>> 27.8K
>>             - Default CF - Size:  23.15MB, KeyIn:  24M, KeyDrop:  19M,
>> Flush:  8.662GB
>>             - [M] CF     - Size: 159.97MB, KeyIn: 162M, KeyDrop: 139M,
>> Flush: 10.335GB <-- with this workload this is pg log and dup op kv entries
>>             - [L] CF     - Size:   1.39MB, KeyIn: 201K, KeyDrop:  89K,
>> Flush:  0.355GB <-- deferred writes                - min_pg_log_entries =
>> 3000, trim = default iops = 28.3K
>>             - Default CF - Size:  23.13MB, KeyIn:  25M, KeyDrop:  19M,
>> Flush:  8.762GB
>>             - [M] CF     - Size: 159.97MB, KeyIn: 202M, KeyDrop: 175M,
>> Flush: 16.890GB <-- with this workload this is pg log and dup op kv entries
>>             - [L] CF     - Size:   0.86MB, KeyIn: 201K, KeyDrop:  89K,
>> Flush:  0.355GB <-- deferred writes
>>         - default min_pg_log_entries = 1500, trim = default, iops = 26.6K
>>             - Default CF - Size:  65.63MB, KeyIn:  22M, KeyDrop:  17M,
>> Flush:  7.858GB
>>             - [M] CF     - Size: 118.09MB, KeyIn: 302M, KeyDrop: 269M,
>> Flush: 15.847GB <-- with this workload this is pg log and dup op kv entries
>>             - [L] CF     - Size:   1.00MB, KeyIn: 181K, KeyDrop:  80K,
>> Flush:  0.320GB <-- deferred writes
>>         - min_pg_log_entries = 10, trim = 10, iops = 24.2K
>>             - Default CF - Size:  23.15MB, KeyIn:  21M, KeyDrop:  16M,
>> Flush:  7.538GB
>>             - [M] CF     - Size:  60.89MB, KeyIn: 277M, KeyDrop: 250M,
>> Flush:  8.884GB <-- with this workload this is pg log and dup op kv entries
>>             - [L] CF     - Size:   1.12MB, KeyIn: 188K, KeyDrop:  83K,
>> Flush:  0.331GB <-- deferred writes
>>         - min_pg_log_entries = 1, trim = 1, iops = 23.8K
>>             - Default CF - Size:  68.58MB, KeyIn:  22M, KeyDrop:  17M,
>> Flush:  7.936GB
>>             - [M] CF     - Size:  96.39MB, KeyIn: 302M, KeyDrop: 262M,
>> Flush:  9.289GB <-- with this workload this is pg log and dup op kv entries
>>             - [L] CF     - Size:   1.04MB, KeyIn: 209K, KeyDrop:  92K,
>> Flush:  0.368GB <-- deferred writes
>>         - min_pg_log_entires = 3000, trim = 1, iops = 25.8K
>
Hi Mark, do you extract above results from compaction stats in Rocksdb LOG?

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB)
Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt)
Avg(sec) KeyIn KeyDrop
----------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      6/0   270.47 MB   1.1      0.0     0.0      0.0       0.2
  0.2       0.0   1.0      0.0    154.3         1         4    0.329
    0      0
  L1      3/0   190.94 MB   0.7      0.0     0.0      0.0       0.0
  0.0       0.0   0.0      0.0      0.0         0         0    0.000
    0      0
 Sum      9/0   461.40 MB   0.0      0.0     0.0      0.0       0.2
  0.2       0.0   1.0      0.0    154.3         1         4    0.329
    0      0
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.2
 0.2       0.0   1.0      0.0    154.3         1         4    0.329
   0      0
Uptime(secs): 9.9 total, 9.9 interval
Flush(GB): cumulative 0.198, interval 0.198

>
> Note specifically how the KeyIn rate drops with the min_pg_log_entries
> increased (ie disable dup_ops) and hacking out pginfo.  I suspect that
> commenting out log_operation would reduce the KeyIn rate significantly
> further.  Again these drives can absorb a lot of this so the improvement in
> iops is fairly modest (and setting min_pg_log_entries low actually hurts!),
> but this isn't just about performance, it's about the behavior that we
> invoke.  The Petstore results absolutely show us that on very fast storage
> we see a dramatic CPU usage reduction by removing log_operation and pginfo,
> so I think we should focus on what kind of behavior we want
> pglog/pginfo/dup_ops to invoke.
>
> Mark
>
>
>>
>>
>> *
>>>
>>>
>>>
>>>
>>>> Cheers,
>>>> Li Wang
>>>>
>>>> 2018-04-02 13:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> Based on your above discussion about pglog, I have the following rough
>>>>> design. Please help to give your suggestions.
>>>>>
>>>>> There will be three partitions: raw part for customer IOs, Bluefs for
>>>>> Rocksdb, and pglog partition.
>>>>> The former two partitions are same as current. The pglog partition is
>>>>> splitted into 1M blocks. We allocate blocks for ring buffers per pg.
>>>>> We will have such following data:
>>>>>
>>>>> Allocation bitmap (just in memory)
>>>>>
>>>>> The pglog partition has a bitmap to record which block is allocated or
>>>>> not. We can rebuild it through pg->allocated_block_list when starting,
>>>>> and no need to store it in persistent disk. But we will store basic
>>>>> information about the pglog partition in Rocksdb, like block size,
>>>>> block number etc when the objectstore is initialized.
>>>>>
>>>>> Pg -> allocated_blocks_list
>>>>>
>>>>> When a pg is created and IOs start, we can allocate a block for every
>>>>> pg. Every pglog entry is less than 300 bytes, 1M can store 3495
>>>>> entries. When total pglog entries increase and exceed the number, we
>>>>> can add a new block to the pg.
>>>>>
>>>>> Pg->start_position
>>>>>
>>>>> Record the oldest valid entry per pg.
>>>>>
>>>>> Pg->next_position
>>>>>
>>>>> Record the next entry to add per pg. The data will be updated
>>>>> frequently, but Rocksdb is suitable for its io mode, and most of
>>>>> data will be merged.
>>>>>
>>>>> Updated Bluestore write progess:
>>>>>
>>>>> When writing data to disk (before metadata updating), we can append
>>>>> the pglog entry to its ring buffer in parallel.
>>>>> After that, submit pg ring buffer changes like pg->next_position, and
>>>>> current other metadata changes to Rocksdb.
>>>>>
>>>>>
>>>>> On Fri, Mar 30, 2018 at 6:23 PM, Varada Kari <varada.kari@xxxxxxxxx>
>>>>> wrote:
>>>>>>
>>>>>> On Fri, Mar 30, 2018 at 1:01 PM, Li Wang <laurence.liwang@xxxxxxxxx>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>    If we wanna store pg log in a standalone ring buffer, another
>>>>>>> candidate
>>>>>>> is the deferred write, why not use the ring buffer as the journal for
>>>>>>> 4K random
>>>>>>> write, it should be much more lightweight than rocksdb
>>>>>>>
>>>>>> It will be similar to FileStore implementation, for small writes. That
>>>>>> comes with the same alignment issues and given
>>>>>> write amplification. Rocksdb nicely abstracts that and we don't make
>>>>>> it to L0 files because of WAL handling.
>>>>>>
>>>>>> Varada
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Li Wang
>>>>>>>
>>>>>>>
>>>>>>> 2018-03-30 4:04 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>:
>>>>>>>>
>>>>>>>> On Wed, 28 Mar 2018, Matt Benjamin wrote:
>>>>>>>>>
>>>>>>>>> On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson <mnelson@xxxxxxxxxx>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> On 03/28/2018 12:21 PM, Adam C. Emerson wrote:
>>>>>>>>>>
>>>>>>>>>> 2) It sure feels like conceptually the pglog should be represented
>>>>>>>>>> as a
>>>>>>>>>> per-pg ring buffer rather than key/value data.  Maybe there are
>>>>>>>>>> really
>>>>>>>>>> important reasons that it shouldn't be, but I don't currently see
>>>>>>>>>> them.  As
>>>>>>>>>> far as the objectstore is concerned, it seems to me like there are
>>>>>>>>>> valid
>>>>>>>>>> reasons to provide some kind of log interface and perhaps that
>>>>>>>>>> should be
>>>>>>>>>> used for pg_log.  That sort of opens the door for different object
>>>>>>>>>> store
>>>>>>>>>> implementations fulfilling that functionality in whatever ways the
>>>>>>>>>> author
>>>>>>>>>> deems fit.
>>>>>>>>>
>>>>>>>>> In the reddit lingo, pretty much this.  We should be concentrating
>>>>>>>>> on
>>>>>>>>> this direction, or ruling it out.
>>>>>>>>
>>>>>>>> Yeah, +1
>>>>>>>>
>>>>>>>> It seems like step 1 is a proof of concept branch that encodes
>>>>>>>> pg_log_entry_t's and writes them to a simple ring buffer.  The first
>>>>>>>> questions to answer is (a) whether this does in fact improve things
>>>>>>>> significantly and (b) whether we want to have an independent ring
>>>>>>>> buffer
>>>>>>>> for each PG or try to mix them into one big one for the whole OSD
>>>>>>>> (or
>>>>>>>> maybe per shard).
>>>>>>>>
>>>>>>>> The second question is how that fares on HDDs.  My guess is that the
>>>>>>>> current rocksdb strategy is better because it reduces the number of
>>>>>>>> IOs
>>>>>>>> and the additional data getting compacted (and CPU usage) isn't the
>>>>>>>> limiting factor on HDD perforamnce (IOPS are).  (But maybe we'll get
>>>>>>>> lucky
>>>>>>>> and the new strategy will be best for both HDD and SSD..)
>>>>>>>>
>>>>>>>> Then we have to modify PGLog to be a complete implementation.  A
>>>>>>>> strict
>>>>>>>> ring buffer probably won't work because the PG log might not trim
>>>>>>>> and
>>>>>>>> because log entries are variable length, so there'll probably need
>>>>>>>> to be
>>>>>>>> some simple mapping table (vs a trivial start/end ring buffer
>>>>>>>> position) to
>>>>>>>> deal with that.  We have to trim the log periodically, so every so
>>>>>>>> many
>>>>>>>> entries we may want to realign with a min_alloc_size boundary.  We
>>>>>>>> someones have to back up and rewrite divergent portions of the log
>>>>>>>> (during
>>>>>>>> peering) so we'll need to sort out whether that is a complete
>>>>>>>> reencode/rewrite or whether we keep encoded entries in ram
>>>>>>>> (individually
>>>>>>>> or in chunks), etc etc.
>>>>>>>>
>>>>>>>> sage
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>> ceph-devel" in
>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best wishes
>>>>> Lisa
>>>>
>>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Best wishes
Lisa
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html