Re: storing pg logs outside of rocksdb

xiaoyan li <wisher2003@xxxxxxxxx> · Wed, 20 Jun 2018 16:19:19 +0800

 Hi all,
I wrote a poc to split pglog from Rocksdb and store them into
standalone space in the block device.
The updates are done in OSD and BlueStore:

OSD parts:
1.       Split pglog entries and pglog info from omaps.
BlueStore:
1.       Allocate 16M space in block device per PG for storing pglog.
2.       Per every transaction from OSD,  combine pglog entries and
pglog info, and write them into a block. The block is set to 4k at
this moment.

Currently, I only make the write workflow work.
With librbd+fio on a cluster with an OSD (on Intel Optane 370G), I got
the following performance for 4k random writes, and the performance
got 13.87% better.

Master:
  write: IOPS=48.3k, BW=189MiB/s (198MB/s)(55.3GiB/300009msec)
    slat (nsec): min=1032, max=1683.2k, avg=4345.13, stdev=3988.69
    clat (msec): min=3, max=123, avg=10.60, stdev= 8.31
     lat (msec): min=3, max=123, avg=10.60, stdev= 8.31

Pgsplit branch:
  write: IOPS=55.0k, BW=215MiB/s (225MB/s)(62.0GiB/300010msec)
    slat (nsec): min=1068, max=1339.7k, avg=4360.58, stdev=3878.47
    clat (msec): min=2, max=120, avg= 9.30, stdev= 6.92
     lat (msec): min=2, max=120, avg= 9.31, stdev= 6.92

Here is the POC: https://github.com/lixiaoy1/ceph/commits/pglog-split-fastinfo
The problem is that per every transaction, I use a 4k block to save
the pglog entries and pglog info which is only 130+920 = 1050 bytes.
This wastes a lot of space.
Any suggestions?

Best wishes
Lisa

On Thu, Apr 5, 2018 at 12:09 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>
>
> On 04/03/2018 09:36 PM, xiaoyan li wrote:
>>
>> On Tue, Apr 3, 2018 at 11:15 PM, Mark Nelson <mark.a.nelson@xxxxxxxxx>
>> wrote:
>>>
>>>
>>> On 04/03/2018 09:56 AM, Mark Nelson wrote:
>>>>
>>>>
>>>>
>>>> On 04/03/2018 08:27 AM, Sage Weil wrote:
>>>>>
>>>>> On Tue, 3 Apr 2018, Li Wang wrote:
>>>>>>
>>>>>> Hi,
>>>>>>     Before we move forward, could someone give a test such that
>>>>>> the pglog not written into rocksdb at all, to see how much is the
>>>>>> performance improvement as the upper bound, it shoule be less than
>>>>>> turning on the bluestore_debug_omit_kv_commit
>>>>>
>>>>> +1
>>>>>
>>>>> (The PetStore behavior doesn't tell us anything about how BlueStore
>>>>> will
>>>>> behave without the pglog overhead.)
>>>>>
>>>>> sage
>>>>
>>>>
>>>> We do have some testing of the bluestore's behavior, though it's about 6
>>>> months old now:
>>>>
>>>> - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD
>>>>
>>>> - 128 PGs
>>>>
>>>> - stats are sloppy since they only appear every ~10 mins
>>>>
>>>> *- default min_pg_log_entries = 1500, trim = default, iops = 26.6K*
>>>>
>>>>     - Default CF - Size:  65.63MB, KeyIn:  22M, KeyDrop:  17M, Flush:
>>>> 7.858GB
>>>>
>>>>     - [M] CF     - Size: 118.09MB, KeyIn: 302M, KeyDrop: 269M, Flush:
>>>> 15.847GB <-- with this workload this is pg log and dup op kv entries
>>>>
>>>>     - [L] CF     - Size:   1.00MB, KeyIn: 181K, KeyDrop:  80K, Flush:
>>>> 0.320GB <-- deferred writes*- min_pg_log_entries = 10, trim = 10, iops =
>>>> 24.2K*
>>>>
>>>>     - Default CF - Size:  23.15MB, KeyIn:  21M, KeyDrop:  16M, Flush:
>>>> 7.538GB
>>>>
>>>>     - [M] CF     - Size:  60.89MB, KeyIn: 277M, KeyDrop: 250M, Flush:
>>>> 8.884GB <-- with this workload this is pg log and dup op kv entries
>>>>
>>>>     - [L] CF     - Size:   1.12MB, KeyIn: 188K, KeyDrop:  83K, Flush:
>>>> 0.331GB <-- deferred writes - min_pg_log_entries = 1, trim = 1, *iops =
>>>> 23.8K*
>>>>
>>>>     - Default CF - Size:  68.58MB, KeyIn:  22M, KeyDrop:  17M, Flush:
>>>> 7.936GB
>>>>
>>>>     - [M] CF     - Size:  96.39MB, KeyIn: 302M, KeyDrop: 262M, Flush:
>>>> 9.289GB <-- with this workload this is pg log and dup op kv entries
>>>>
>>>>     - [L] CF     - Size:   1.04MB, KeyIn: 209K, KeyDrop:  92K, Flush:
>>>> 0.368GB <-- deferred writes
>>>>
>>>> - min_pg_log_entires = 3000, trim = 1, *iops = 25.8K*
>>>>
>>>> *
>>>> The actual performance variation here I think is much less important
>>>> than
>>>> the KeyIn behavior.  The NVMe devices in these tests are fast enough to
>>>> absorb a fair amount of overhead.
>>>
>>>
>>> Ugh, sorry.  That will teach me to talk in meeting and paste at the same
>>> time.  Those were the wrong stats.  Here are the right ones:
>>>
>>>>          - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD
>>>>          - 128 PGs
>>>>          - stats are sloppy since they only appear every ~10 mins
>>>>          - min_pg_log_entries = 3000, trim = default, pginfo hack, iops
>>>> =
>>>> 27.8K
>>>>              - Default CF - Size:  23.15MB, KeyIn:  24M, KeyDrop:  19M,
>>>> Flush:  8.662GB
>>>>              - [M] CF     - Size: 159.97MB, KeyIn: 162M, KeyDrop: 139M,
>>>> Flush: 10.335GB <-- with this workload this is pg log and dup op kv
>>>> entries
>>>>              - [L] CF     - Size:   1.39MB, KeyIn: 201K, KeyDrop:  89K,
>>>> Flush:  0.355GB <-- deferred writes                - min_pg_log_entries
>>>> =
>>>> 3000, trim = default iops = 28.3K
>>>>              - Default CF - Size:  23.13MB, KeyIn:  25M, KeyDrop:  19M,
>>>> Flush:  8.762GB
>>>>              - [M] CF     - Size: 159.97MB, KeyIn: 202M, KeyDrop: 175M,
>>>> Flush: 16.890GB <-- with this workload this is pg log and dup op kv
>>>> entries
>>>>              - [L] CF     - Size:   0.86MB, KeyIn: 201K, KeyDrop:  89K,
>>>> Flush:  0.355GB <-- deferred writes
>>>>          - default min_pg_log_entries = 1500, trim = default, iops =
>>>> 26.6K
>>>>              - Default CF - Size:  65.63MB, KeyIn:  22M, KeyDrop:  17M,
>>>> Flush:  7.858GB
>>>>              - [M] CF     - Size: 118.09MB, KeyIn: 302M, KeyDrop: 269M,
>>>> Flush: 15.847GB <-- with this workload this is pg log and dup op kv
>>>> entries
>>>>              - [L] CF     - Size:   1.00MB, KeyIn: 181K, KeyDrop:  80K,
>>>> Flush:  0.320GB <-- deferred writes
>>>>          - min_pg_log_entries = 10, trim = 10, iops = 24.2K
>>>>              - Default CF - Size:  23.15MB, KeyIn:  21M, KeyDrop:  16M,
>>>> Flush:  7.538GB
>>>>              - [M] CF     - Size:  60.89MB, KeyIn: 277M, KeyDrop: 250M,
>>>> Flush:  8.884GB <-- with this workload this is pg log and dup op kv
>>>> entries
>>>>              - [L] CF     - Size:   1.12MB, KeyIn: 188K, KeyDrop:  83K,
>>>> Flush:  0.331GB <-- deferred writes
>>>>          - min_pg_log_entries = 1, trim = 1, iops = 23.8K
>>>>              - Default CF - Size:  68.58MB, KeyIn:  22M, KeyDrop:  17M,
>>>> Flush:  7.936GB
>>>>              - [M] CF     - Size:  96.39MB, KeyIn: 302M, KeyDrop: 262M,
>>>> Flush:  9.289GB <-- with this workload this is pg log and dup op kv
>>>> entries
>>>>              - [L] CF     - Size:   1.04MB, KeyIn: 209K, KeyDrop:  92K,
>>>> Flush:  0.368GB <-- deferred writes
>>>>          - min_pg_log_entires = 3000, trim = 1, iops = 25.8K
>>
>> Hi Mark, do you extract above results from compaction stats in Rocksdb
>> LOG?
>
>
> Correct, except for the IOPS numbers which were from the client benchmark.
>
>
>>
>> ** Compaction Stats [default] **
>> Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB)
>> Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt)
>> Avg(sec) KeyIn KeyDrop
>>
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>>    L0      6/0   270.47 MB   1.1      0.0     0.0      0.0       0.2
>>    0.2       0.0   1.0      0.0    154.3         1         4    0.329
>>      0      0
>>    L1      3/0   190.94 MB   0.7      0.0     0.0      0.0       0.0
>>    0.0       0.0   0.0      0.0      0.0         0         0    0.000
>>      0      0
>>   Sum      9/0   461.40 MB   0.0      0.0     0.0      0.0       0.2
>>    0.2       0.0   1.0      0.0    154.3         1         4    0.329
>>      0      0
>>   Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.2
>>   0.2       0.0   1.0      0.0    154.3         1         4    0.329
>>     0      0
>> Uptime(secs): 9.9 total, 9.9 interval
>> Flush(GB): cumulative 0.198, interval 0.198
>>
>>> Note specifically how the KeyIn rate drops with the min_pg_log_entries
>>> increased (ie disable dup_ops) and hacking out pginfo.  I suspect that
>>> commenting out log_operation would reduce the KeyIn rate significantly
>>> further.  Again these drives can absorb a lot of this so the improvement
>>> in
>>> iops is fairly modest (and setting min_pg_log_entries low actually
>>> hurts!),
>>> but this isn't just about performance, it's about the behavior that we
>>> invoke.  The Petstore results absolutely show us that on very fast
>>> storage
>>> we see a dramatic CPU usage reduction by removing log_operation and
>>> pginfo,
>>> so I think we should focus on what kind of behavior we want
>>> pglog/pginfo/dup_ops to invoke.
>>>
>>> Mark
>>>
>>>
>>>>
>>>> *
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Cheers,
>>>>>> Li Wang
>>>>>>
>>>>>> 2018-04-02 13:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Based on your above discussion about pglog, I have the following
>>>>>>> rough
>>>>>>> design. Please help to give your suggestions.
>>>>>>>
>>>>>>> There will be three partitions: raw part for customer IOs, Bluefs for
>>>>>>> Rocksdb, and pglog partition.
>>>>>>> The former two partitions are same as current. The pglog partition is
>>>>>>> splitted into 1M blocks. We allocate blocks for ring buffers per pg.
>>>>>>> We will have such following data:
>>>>>>>
>>>>>>> Allocation bitmap (just in memory)
>>>>>>>
>>>>>>> The pglog partition has a bitmap to record which block is allocated
>>>>>>> or
>>>>>>> not. We can rebuild it through pg->allocated_block_list when
>>>>>>> starting,
>>>>>>> and no need to store it in persistent disk. But we will store basic
>>>>>>> information about the pglog partition in Rocksdb, like block size,
>>>>>>> block number etc when the objectstore is initialized.
>>>>>>>
>>>>>>> Pg -> allocated_blocks_list
>>>>>>>
>>>>>>> When a pg is created and IOs start, we can allocate a block for every
>>>>>>> pg. Every pglog entry is less than 300 bytes, 1M can store 3495
>>>>>>> entries. When total pglog entries increase and exceed the number, we
>>>>>>> can add a new block to the pg.
>>>>>>>
>>>>>>> Pg->start_position
>>>>>>>
>>>>>>> Record the oldest valid entry per pg.
>>>>>>>
>>>>>>> Pg->next_position
>>>>>>>
>>>>>>> Record the next entry to add per pg. The data will be updated
>>>>>>> frequently, but Rocksdb is suitable for its io mode, and most of
>>>>>>> data will be merged.
>>>>>>>
>>>>>>> Updated Bluestore write progess:
>>>>>>>
>>>>>>> When writing data to disk (before metadata updating), we can append
>>>>>>> the pglog entry to its ring buffer in parallel.
>>>>>>> After that, submit pg ring buffer changes like pg->next_position, and
>>>>>>> current other metadata changes to Rocksdb.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 30, 2018 at 6:23 PM, Varada Kari <varada.kari@xxxxxxxxx>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On Fri, Mar 30, 2018 at 1:01 PM, Li Wang <laurence.liwang@xxxxxxxxx>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>     If we wanna store pg log in a standalone ring buffer, another
>>>>>>>>> candidate
>>>>>>>>> is the deferred write, why not use the ring buffer as the journal
>>>>>>>>> for
>>>>>>>>> 4K random
>>>>>>>>> write, it should be much more lightweight than rocksdb
>>>>>>>>>
>>>>>>>> It will be similar to FileStore implementation, for small writes.
>>>>>>>> That
>>>>>>>> comes with the same alignment issues and given
>>>>>>>> write amplification. Rocksdb nicely abstracts that and we don't make
>>>>>>>> it to L0 files because of WAL handling.
>>>>>>>>
>>>>>>>> Varada
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Li Wang
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2018-03-30 4:04 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>:
>>>>>>>>>>
>>>>>>>>>> On Wed, 28 Mar 2018, Matt Benjamin wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson <mnelson@xxxxxxxxxx>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 03/28/2018 12:21 PM, Adam C. Emerson wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> 2) It sure feels like conceptually the pglog should be
>>>>>>>>>>>> represented
>>>>>>>>>>>> as a
>>>>>>>>>>>> per-pg ring buffer rather than key/value data.  Maybe there are
>>>>>>>>>>>> really
>>>>>>>>>>>> important reasons that it shouldn't be, but I don't currently
>>>>>>>>>>>> see
>>>>>>>>>>>> them.  As
>>>>>>>>>>>> far as the objectstore is concerned, it seems to me like there
>>>>>>>>>>>> are
>>>>>>>>>>>> valid
>>>>>>>>>>>> reasons to provide some kind of log interface and perhaps that
>>>>>>>>>>>> should be
>>>>>>>>>>>> used for pg_log.  That sort of opens the door for different
>>>>>>>>>>>> object
>>>>>>>>>>>> store
>>>>>>>>>>>> implementations fulfilling that functionality in whatever ways
>>>>>>>>>>>> the
>>>>>>>>>>>> author
>>>>>>>>>>>> deems fit.
>>>>>>>>>>>
>>>>>>>>>>> In the reddit lingo, pretty much this.  We should be
>>>>>>>>>>> concentrating
>>>>>>>>>>> on
>>>>>>>>>>> this direction, or ruling it out.
>>>>>>>>>>
>>>>>>>>>> Yeah, +1
>>>>>>>>>>
>>>>>>>>>> It seems like step 1 is a proof of concept branch that encodes
>>>>>>>>>> pg_log_entry_t's and writes them to a simple ring buffer.  The
>>>>>>>>>> first
>>>>>>>>>> questions to answer is (a) whether this does in fact improve
>>>>>>>>>> things
>>>>>>>>>> significantly and (b) whether we want to have an independent ring
>>>>>>>>>> buffer
>>>>>>>>>> for each PG or try to mix them into one big one for the whole OSD
>>>>>>>>>> (or
>>>>>>>>>> maybe per shard).
>>>>>>>>>>
>>>>>>>>>> The second question is how that fares on HDDs.  My guess is that
>>>>>>>>>> the
>>>>>>>>>> current rocksdb strategy is better because it reduces the number
>>>>>>>>>> of
>>>>>>>>>> IOs
>>>>>>>>>> and the additional data getting compacted (and CPU usage) isn't
>>>>>>>>>> the
>>>>>>>>>> limiting factor on HDD perforamnce (IOPS are).  (But maybe we'll
>>>>>>>>>> get
>>>>>>>>>> lucky
>>>>>>>>>> and the new strategy will be best for both HDD and SSD..)
>>>>>>>>>>
>>>>>>>>>> Then we have to modify PGLog to be a complete implementation.  A
>>>>>>>>>> strict
>>>>>>>>>> ring buffer probably won't work because the PG log might not trim
>>>>>>>>>> and
>>>>>>>>>> because log entries are variable length, so there'll probably need
>>>>>>>>>> to be
>>>>>>>>>> some simple mapping table (vs a trivial start/end ring buffer
>>>>>>>>>> position) to
>>>>>>>>>> deal with that.  We have to trim the log periodically, so every so
>>>>>>>>>> many
>>>>>>>>>> entries we may want to realign with a min_alloc_size boundary.  We
>>>>>>>>>> someones have to back up and rewrite divergent portions of the log
>>>>>>>>>> (during
>>>>>>>>>> peering) so we'll need to sort out whether that is a complete
>>>>>>>>>> reencode/rewrite or whether we keep encoded entries in ram
>>>>>>>>>> (individually
>>>>>>>>>> or in chunks), etc etc.
>>>>>>>>>>
>>>>>>>>>> sage
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>> ceph-devel" in
>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best wishes
>>>>>>> Lisa
>>>>>>
>>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>

-- 
Best wishes
Lisa
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html