Re: storing pg logs outside of rocksdb

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 20 Jun 2018 11:25:02 -0700

On Wed, Jun 20, 2018 at 1:41 AM xiaoyan li <wisher2003@xxxxxxxxx> wrote:
>
>  Hi all,
> I wrote a poc to split pglog from Rocksdb and store them into
> standalone space in the block device.
> The updates are done in OSD and BlueStore:
>
> OSD parts:
> 1.       Split pglog entries and pglog info from omaps.
> BlueStore:
> 1.       Allocate 16M space in block device per PG for storing pglog.
> 2.       Per every transaction from OSD,  combine pglog entries and
> pglog info, and write them into a block. The block is set to 4k at
> this moment.
>
> Currently, I only make the write workflow work.
> With librbd+fio on a cluster with an OSD (on Intel Optane 370G), I got
> the following performance for 4k random writes, and the performance
> got 13.87% better.
>
> Master:
>   write: IOPS=48.3k, BW=189MiB/s (198MB/s)(55.3GiB/300009msec)
>     slat (nsec): min=1032, max=1683.2k, avg=4345.13, stdev=3988.69
>     clat (msec): min=3, max=123, avg=10.60, stdev= 8.31
>      lat (msec): min=3, max=123, avg=10.60, stdev= 8.31
>
> Pgsplit branch:
>   write: IOPS=55.0k, BW=215MiB/s (225MB/s)(62.0GiB/300010msec)
>     slat (nsec): min=1068, max=1339.7k, avg=4360.58, stdev=3878.47
>     clat (msec): min=2, max=120, avg= 9.30, stdev= 6.92
>      lat (msec): min=2, max=120, avg= 9.31, stdev= 6.92
>
> Here is the POC: https://github.com/lixiaoy1/ceph/commits/pglog-split-fastinfo
> The problem is that per every transaction, I use a 4k block to save
> the pglog entries and pglog info which is only 130+920 = 1050 bytes.
> This wastes a lot of space.
> Any suggestions?

It's actually worse news than that: with your workload, you're using
1050 bytes per entry. But pglog entries contain the object's name! The
object name, IIRC, can be 4KB on its own (or maybe even unbounded, but
that's the practical limit because only RGW gets close to that and S3
has a 4K limit?). So it's possible a single entry could overflow that
block. :/

*Maybe* we could try and strip the object name out of the entry (it
also has the hash), but we'd have to look very carefully if that's
feasible with our other interfaces. (Hashes aren't really unique, or
at least haven't been in the past, given the features we've had to
co-locate objects with the same hash etc.)
-Greg

>
> Best wishes
> Lisa
>
> On Thu, Apr 5, 2018 at 12:09 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> >
> >
> > On 04/03/2018 09:36 PM, xiaoyan li wrote:
> >>
> >> On Tue, Apr 3, 2018 at 11:15 PM, Mark Nelson <mark.a.nelson@xxxxxxxxx>
> >> wrote:
> >>>
> >>>
> >>> On 04/03/2018 09:56 AM, Mark Nelson wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 04/03/2018 08:27 AM, Sage Weil wrote:
> >>>>>
> >>>>> On Tue, 3 Apr 2018, Li Wang wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>     Before we move forward, could someone give a test such that
> >>>>>> the pglog not written into rocksdb at all, to see how much is the
> >>>>>> performance improvement as the upper bound, it shoule be less than
> >>>>>> turning on the bluestore_debug_omit_kv_commit
> >>>>>
> >>>>> +1
> >>>>>
> >>>>> (The PetStore behavior doesn't tell us anything about how BlueStore
> >>>>> will
> >>>>> behave without the pglog overhead.)
> >>>>>
> >>>>> sage
> >>>>
> >>>>
> >>>> We do have some testing of the bluestore's behavior, though it's about 6
> >>>> months old now:
> >>>>
> >>>> - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD
> >>>>
> >>>> - 128 PGs
> >>>>
> >>>> - stats are sloppy since they only appear every ~10 mins
> >>>>
> >>>> *- default min_pg_log_entries = 1500, trim = default, iops = 26.6K*
> >>>>
> >>>>     - Default CF - Size:  65.63MB, KeyIn:  22M, KeyDrop:  17M, Flush:
> >>>> 7.858GB
> >>>>
> >>>>     - [M] CF     - Size: 118.09MB, KeyIn: 302M, KeyDrop: 269M, Flush:
> >>>> 15.847GB <-- with this workload this is pg log and dup op kv entries
> >>>>
> >>>>     - [L] CF     - Size:   1.00MB, KeyIn: 181K, KeyDrop:  80K, Flush:
> >>>> 0.320GB <-- deferred writes*- min_pg_log_entries = 10, trim = 10, iops =
> >>>> 24.2K*
> >>>>
> >>>>     - Default CF - Size:  23.15MB, KeyIn:  21M, KeyDrop:  16M, Flush:
> >>>> 7.538GB
> >>>>
> >>>>     - [M] CF     - Size:  60.89MB, KeyIn: 277M, KeyDrop: 250M, Flush:
> >>>> 8.884GB <-- with this workload this is pg log and dup op kv entries
> >>>>
> >>>>     - [L] CF     - Size:   1.12MB, KeyIn: 188K, KeyDrop:  83K, Flush:
> >>>> 0.331GB <-- deferred writes - min_pg_log_entries = 1, trim = 1, *iops =
> >>>> 23.8K*
> >>>>
> >>>>     - Default CF - Size:  68.58MB, KeyIn:  22M, KeyDrop:  17M, Flush:
> >>>> 7.936GB
> >>>>
> >>>>     - [M] CF     - Size:  96.39MB, KeyIn: 302M, KeyDrop: 262M, Flush:
> >>>> 9.289GB <-- with this workload this is pg log and dup op kv entries
> >>>>
> >>>>     - [L] CF     - Size:   1.04MB, KeyIn: 209K, KeyDrop:  92K, Flush:
> >>>> 0.368GB <-- deferred writes
> >>>>
> >>>> - min_pg_log_entires = 3000, trim = 1, *iops = 25.8K*
> >>>>
> >>>> *
> >>>> The actual performance variation here I think is much less important
> >>>> than
> >>>> the KeyIn behavior.  The NVMe devices in these tests are fast enough to
> >>>> absorb a fair amount of overhead.
> >>>
> >>>
> >>> Ugh, sorry.  That will teach me to talk in meeting and paste at the same
> >>> time.  Those were the wrong stats.  Here are the right ones:
> >>>
> >>>>          - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD
> >>>>          - 128 PGs
> >>>>          - stats are sloppy since they only appear every ~10 mins
> >>>>          - min_pg_log_entries = 3000, trim = default, pginfo hack, iops
> >>>> =
> >>>> 27.8K
> >>>>              - Default CF - Size:  23.15MB, KeyIn:  24M, KeyDrop:  19M,
> >>>> Flush:  8.662GB
> >>>>              - [M] CF     - Size: 159.97MB, KeyIn: 162M, KeyDrop: 139M,
> >>>> Flush: 10.335GB <-- with this workload this is pg log and dup op kv
> >>>> entries
> >>>>              - [L] CF     - Size:   1.39MB, KeyIn: 201K, KeyDrop:  89K,
> >>>> Flush:  0.355GB <-- deferred writes                - min_pg_log_entries
> >>>> =
> >>>> 3000, trim = default iops = 28.3K
> >>>>              - Default CF - Size:  23.13MB, KeyIn:  25M, KeyDrop:  19M,
> >>>> Flush:  8.762GB
> >>>>              - [M] CF     - Size: 159.97MB, KeyIn: 202M, KeyDrop: 175M,
> >>>> Flush: 16.890GB <-- with this workload this is pg log and dup op kv
> >>>> entries
> >>>>              - [L] CF     - Size:   0.86MB, KeyIn: 201K, KeyDrop:  89K,
> >>>> Flush:  0.355GB <-- deferred writes
> >>>>          - default min_pg_log_entries = 1500, trim = default, iops =
> >>>> 26.6K
> >>>>              - Default CF - Size:  65.63MB, KeyIn:  22M, KeyDrop:  17M,
> >>>> Flush:  7.858GB
> >>>>              - [M] CF     - Size: 118.09MB, KeyIn: 302M, KeyDrop: 269M,
> >>>> Flush: 15.847GB <-- with this workload this is pg log and dup op kv
> >>>> entries
> >>>>              - [L] CF     - Size:   1.00MB, KeyIn: 181K, KeyDrop:  80K,
> >>>> Flush:  0.320GB <-- deferred writes
> >>>>          - min_pg_log_entries = 10, trim = 10, iops = 24.2K
> >>>>              - Default CF - Size:  23.15MB, KeyIn:  21M, KeyDrop:  16M,
> >>>> Flush:  7.538GB
> >>>>              - [M] CF     - Size:  60.89MB, KeyIn: 277M, KeyDrop: 250M,
> >>>> Flush:  8.884GB <-- with this workload this is pg log and dup op kv
> >>>> entries
> >>>>              - [L] CF     - Size:   1.12MB, KeyIn: 188K, KeyDrop:  83K,
> >>>> Flush:  0.331GB <-- deferred writes
> >>>>          - min_pg_log_entries = 1, trim = 1, iops = 23.8K
> >>>>              - Default CF - Size:  68.58MB, KeyIn:  22M, KeyDrop:  17M,
> >>>> Flush:  7.936GB
> >>>>              - [M] CF     - Size:  96.39MB, KeyIn: 302M, KeyDrop: 262M,
> >>>> Flush:  9.289GB <-- with this workload this is pg log and dup op kv
> >>>> entries
> >>>>              - [L] CF     - Size:   1.04MB, KeyIn: 209K, KeyDrop:  92K,
> >>>> Flush:  0.368GB <-- deferred writes
> >>>>          - min_pg_log_entires = 3000, trim = 1, iops = 25.8K
> >>
> >> Hi Mark, do you extract above results from compaction stats in Rocksdb
> >> LOG?
> >
> >
> > Correct, except for the IOPS numbers which were from the client benchmark.
> >
> >
> >>
> >> ** Compaction Stats [default] **
> >> Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB)
> >> Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt)
> >> Avg(sec) KeyIn KeyDrop
> >>
> >> ----------------------------------------------------------------------------------------------------------------------------------------------------------
> >>    L0      6/0   270.47 MB   1.1      0.0     0.0      0.0       0.2
> >>    0.2       0.0   1.0      0.0    154.3         1         4    0.329
> >>      0      0
> >>    L1      3/0   190.94 MB   0.7      0.0     0.0      0.0       0.0
> >>    0.0       0.0   0.0      0.0      0.0         0         0    0.000
> >>      0      0
> >>   Sum      9/0   461.40 MB   0.0      0.0     0.0      0.0       0.2
> >>    0.2       0.0   1.0      0.0    154.3         1         4    0.329
> >>      0      0
> >>   Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.2
> >>   0.2       0.0   1.0      0.0    154.3         1         4    0.329
> >>     0      0
> >> Uptime(secs): 9.9 total, 9.9 interval
> >> Flush(GB): cumulative 0.198, interval 0.198
> >>
> >>> Note specifically how the KeyIn rate drops with the min_pg_log_entries
> >>> increased (ie disable dup_ops) and hacking out pginfo.  I suspect that
> >>> commenting out log_operation would reduce the KeyIn rate significantly
> >>> further.  Again these drives can absorb a lot of this so the improvement
> >>> in
> >>> iops is fairly modest (and setting min_pg_log_entries low actually
> >>> hurts!),
> >>> but this isn't just about performance, it's about the behavior that we
> >>> invoke.  The Petstore results absolutely show us that on very fast
> >>> storage
> >>> we see a dramatic CPU usage reduction by removing log_operation and
> >>> pginfo,
> >>> so I think we should focus on what kind of behavior we want
> >>> pglog/pginfo/dup_ops to invoke.
> >>>
> >>> Mark
> >>>
> >>>
> >>>>
> >>>> *
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Cheers,
> >>>>>> Li Wang
> >>>>>>
> >>>>>> 2018-04-02 13:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>:
> >>>>>>>
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>> Based on your above discussion about pglog, I have the following
> >>>>>>> rough
> >>>>>>> design. Please help to give your suggestions.
> >>>>>>>
> >>>>>>> There will be three partitions: raw part for customer IOs, Bluefs for
> >>>>>>> Rocksdb, and pglog partition.
> >>>>>>> The former two partitions are same as current. The pglog partition is
> >>>>>>> splitted into 1M blocks. We allocate blocks for ring buffers per pg.
> >>>>>>> We will have such following data:
> >>>>>>>
> >>>>>>> Allocation bitmap (just in memory)
> >>>>>>>
> >>>>>>> The pglog partition has a bitmap to record which block is allocated
> >>>>>>> or
> >>>>>>> not. We can rebuild it through pg->allocated_block_list when
> >>>>>>> starting,
> >>>>>>> and no need to store it in persistent disk. But we will store basic
> >>>>>>> information about the pglog partition in Rocksdb, like block size,
> >>>>>>> block number etc when the objectstore is initialized.
> >>>>>>>
> >>>>>>> Pg -> allocated_blocks_list
> >>>>>>>
> >>>>>>> When a pg is created and IOs start, we can allocate a block for every
> >>>>>>> pg. Every pglog entry is less than 300 bytes, 1M can store 3495
> >>>>>>> entries. When total pglog entries increase and exceed the number, we
> >>>>>>> can add a new block to the pg.
> >>>>>>>
> >>>>>>> Pg->start_position
> >>>>>>>
> >>>>>>> Record the oldest valid entry per pg.
> >>>>>>>
> >>>>>>> Pg->next_position
> >>>>>>>
> >>>>>>> Record the next entry to add per pg. The data will be updated
> >>>>>>> frequently, but Rocksdb is suitable for its io mode, and most of
> >>>>>>> data will be merged.
> >>>>>>>
> >>>>>>> Updated Bluestore write progess:
> >>>>>>>
> >>>>>>> When writing data to disk (before metadata updating), we can append
> >>>>>>> the pglog entry to its ring buffer in parallel.
> >>>>>>> After that, submit pg ring buffer changes like pg->next_position, and
> >>>>>>> current other metadata changes to Rocksdb.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Mar 30, 2018 at 6:23 PM, Varada Kari <varada.kari@xxxxxxxxx>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> On Fri, Mar 30, 2018 at 1:01 PM, Li Wang <laurence.liwang@xxxxxxxxx>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>     If we wanna store pg log in a standalone ring buffer, another
> >>>>>>>>> candidate
> >>>>>>>>> is the deferred write, why not use the ring buffer as the journal
> >>>>>>>>> for
> >>>>>>>>> 4K random
> >>>>>>>>> write, it should be much more lightweight than rocksdb
> >>>>>>>>>
> >>>>>>>> It will be similar to FileStore implementation, for small writes.
> >>>>>>>> That
> >>>>>>>> comes with the same alignment issues and given
> >>>>>>>> write amplification. Rocksdb nicely abstracts that and we don't make
> >>>>>>>> it to L0 files because of WAL handling.
> >>>>>>>>
> >>>>>>>> Varada
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>> Li Wang
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 2018-03-30 4:04 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>:
> >>>>>>>>>>
> >>>>>>>>>> On Wed, 28 Mar 2018, Matt Benjamin wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson <mnelson@xxxxxxxxxx>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 03/28/2018 12:21 PM, Adam C. Emerson wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2) It sure feels like conceptually the pglog should be
> >>>>>>>>>>>> represented
> >>>>>>>>>>>> as a
> >>>>>>>>>>>> per-pg ring buffer rather than key/value data.  Maybe there are
> >>>>>>>>>>>> really
> >>>>>>>>>>>> important reasons that it shouldn't be, but I don't currently
> >>>>>>>>>>>> see
> >>>>>>>>>>>> them.  As
> >>>>>>>>>>>> far as the objectstore is concerned, it seems to me like there
> >>>>>>>>>>>> are
> >>>>>>>>>>>> valid
> >>>>>>>>>>>> reasons to provide some kind of log interface and perhaps that
> >>>>>>>>>>>> should be
> >>>>>>>>>>>> used for pg_log.  That sort of opens the door for different
> >>>>>>>>>>>> object
> >>>>>>>>>>>> store
> >>>>>>>>>>>> implementations fulfilling that functionality in whatever ways
> >>>>>>>>>>>> the
> >>>>>>>>>>>> author
> >>>>>>>>>>>> deems fit.
> >>>>>>>>>>>
> >>>>>>>>>>> In the reddit lingo, pretty much this.  We should be
> >>>>>>>>>>> concentrating
> >>>>>>>>>>> on
> >>>>>>>>>>> this direction, or ruling it out.
> >>>>>>>>>>
> >>>>>>>>>> Yeah, +1
> >>>>>>>>>>
> >>>>>>>>>> It seems like step 1 is a proof of concept branch that encodes
> >>>>>>>>>> pg_log_entry_t's and writes them to a simple ring buffer.  The
> >>>>>>>>>> first
> >>>>>>>>>> questions to answer is (a) whether this does in fact improve
> >>>>>>>>>> things
> >>>>>>>>>> significantly and (b) whether we want to have an independent ring
> >>>>>>>>>> buffer
> >>>>>>>>>> for each PG or try to mix them into one big one for the whole OSD
> >>>>>>>>>> (or
> >>>>>>>>>> maybe per shard).
> >>>>>>>>>>
> >>>>>>>>>> The second question is how that fares on HDDs.  My guess is that
> >>>>>>>>>> the
> >>>>>>>>>> current rocksdb strategy is better because it reduces the number
> >>>>>>>>>> of
> >>>>>>>>>> IOs
> >>>>>>>>>> and the additional data getting compacted (and CPU usage) isn't
> >>>>>>>>>> the
> >>>>>>>>>> limiting factor on HDD perforamnce (IOPS are).  (But maybe we'll
> >>>>>>>>>> get
> >>>>>>>>>> lucky
> >>>>>>>>>> and the new strategy will be best for both HDD and SSD..)
> >>>>>>>>>>
> >>>>>>>>>> Then we have to modify PGLog to be a complete implementation.  A
> >>>>>>>>>> strict
> >>>>>>>>>> ring buffer probably won't work because the PG log might not trim
> >>>>>>>>>> and
> >>>>>>>>>> because log entries are variable length, so there'll probably need
> >>>>>>>>>> to be
> >>>>>>>>>> some simple mapping table (vs a trivial start/end ring buffer
> >>>>>>>>>> position) to
> >>>>>>>>>> deal with that.  We have to trim the log periodically, so every so
> >>>>>>>>>> many
> >>>>>>>>>> entries we may want to realign with a min_alloc_size boundary.  We
> >>>>>>>>>> someones have to back up and rewrite divergent portions of the log
> >>>>>>>>>> (during
> >>>>>>>>>> peering) so we'll need to sort out whether that is a complete
> >>>>>>>>>> reencode/rewrite or whether we keep encoded entries in ram
> >>>>>>>>>> (individually
> >>>>>>>>>> or in chunks), etc etc.
> >>>>>>>>>>
> >>>>>>>>>> sage
> >>>>>>>>>> --
> >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
> >>>>>>>>>> ceph-devel" in
> >>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Best wishes
> >>>>>>> Lisa
> >>>>>>
> >>>>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> >>
> >
>
>
>
> --
> Best wishes
> Lisa
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html