Hi all, I wrote a poc to split pglog from Rocksdb and store them into standalone space in the block device. The updates are done in OSD and BlueStore: OSD parts: 1. Split pglog entries and pglog info from omaps. BlueStore: 1. Allocate 16M space in block device per PG for storing pglog. 2. Per every transaction from OSD, combine pglog entries and pglog info, and write them into a block. The block is set to 4k at this moment. Currently, I only make the write workflow work. With librbd+fio on a cluster with an OSD (on Intel Optane 370G), I got the following performance for 4k random writes, and the performance got 13.87% better. Master: write: IOPS=48.3k, BW=189MiB/s (198MB/s)(55.3GiB/300009msec) slat (nsec): min=1032, max=1683.2k, avg=4345.13, stdev=3988.69 clat (msec): min=3, max=123, avg=10.60, stdev= 8.31 lat (msec): min=3, max=123, avg=10.60, stdev= 8.31 Pgsplit branch: write: IOPS=55.0k, BW=215MiB/s (225MB/s)(62.0GiB/300010msec) slat (nsec): min=1068, max=1339.7k, avg=4360.58, stdev=3878.47 clat (msec): min=2, max=120, avg= 9.30, stdev= 6.92 lat (msec): min=2, max=120, avg= 9.31, stdev= 6.92 Here is the POC: https://github.com/lixiaoy1/ceph/commits/pglog-split-fastinfo The problem is that per every transaction, I use a 4k block to save the pglog entries and pglog info which is only 130+920 = 1050 bytes. This wastes a lot of space. Any suggestions? Best wishes Lisa On Thu, Apr 5, 2018 at 12:09 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote: > > > On 04/03/2018 09:36 PM, xiaoyan li wrote: >> >> On Tue, Apr 3, 2018 at 11:15 PM, Mark Nelson <mark.a.nelson@xxxxxxxxx> >> wrote: >>> >>> >>> On 04/03/2018 09:56 AM, Mark Nelson wrote: >>>> >>>> >>>> >>>> On 04/03/2018 08:27 AM, Sage Weil wrote: >>>>> >>>>> On Tue, 3 Apr 2018, Li Wang wrote: >>>>>> >>>>>> Hi, >>>>>> Before we move forward, could someone give a test such that >>>>>> the pglog not written into rocksdb at all, to see how much is the >>>>>> performance improvement as the upper bound, it shoule be less than >>>>>> turning on the bluestore_debug_omit_kv_commit >>>>> >>>>> +1 >>>>> >>>>> (The PetStore behavior doesn't tell us anything about how BlueStore >>>>> will >>>>> behave without the pglog overhead.) >>>>> >>>>> sage >>>> >>>> >>>> We do have some testing of the bluestore's behavior, though it's about 6 >>>> months old now: >>>> >>>> - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD >>>> >>>> - 128 PGs >>>> >>>> - stats are sloppy since they only appear every ~10 mins >>>> >>>> *- default min_pg_log_entries = 1500, trim = default, iops = 26.6K* >>>> >>>> - Default CF - Size: 65.63MB, KeyIn: 22M, KeyDrop: 17M, Flush: >>>> 7.858GB >>>> >>>> - [M] CF - Size: 118.09MB, KeyIn: 302M, KeyDrop: 269M, Flush: >>>> 15.847GB <-- with this workload this is pg log and dup op kv entries >>>> >>>> - [L] CF - Size: 1.00MB, KeyIn: 181K, KeyDrop: 80K, Flush: >>>> 0.320GB <-- deferred writes*- min_pg_log_entries = 10, trim = 10, iops = >>>> 24.2K* >>>> >>>> - Default CF - Size: 23.15MB, KeyIn: 21M, KeyDrop: 16M, Flush: >>>> 7.538GB >>>> >>>> - [M] CF - Size: 60.89MB, KeyIn: 277M, KeyDrop: 250M, Flush: >>>> 8.884GB <-- with this workload this is pg log and dup op kv entries >>>> >>>> - [L] CF - Size: 1.12MB, KeyIn: 188K, KeyDrop: 83K, Flush: >>>> 0.331GB <-- deferred writes - min_pg_log_entries = 1, trim = 1, *iops = >>>> 23.8K* >>>> >>>> - Default CF - Size: 68.58MB, KeyIn: 22M, KeyDrop: 17M, Flush: >>>> 7.936GB >>>> >>>> - [M] CF - Size: 96.39MB, KeyIn: 302M, KeyDrop: 262M, Flush: >>>> 9.289GB <-- with this workload this is pg log and dup op kv entries >>>> >>>> - [L] CF - Size: 1.04MB, KeyIn: 209K, KeyDrop: 92K, Flush: >>>> 0.368GB <-- deferred writes >>>> >>>> - min_pg_log_entires = 3000, trim = 1, *iops = 25.8K* >>>> >>>> * >>>> The actual performance variation here I think is much less important >>>> than >>>> the KeyIn behavior. The NVMe devices in these tests are fast enough to >>>> absorb a fair amount of overhead. >>> >>> >>> Ugh, sorry. That will teach me to talk in meeting and paste at the same >>> time. Those were the wrong stats. Here are the right ones: >>> >>>> - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD >>>> - 128 PGs >>>> - stats are sloppy since they only appear every ~10 mins >>>> - min_pg_log_entries = 3000, trim = default, pginfo hack, iops >>>> = >>>> 27.8K >>>> - Default CF - Size: 23.15MB, KeyIn: 24M, KeyDrop: 19M, >>>> Flush: 8.662GB >>>> - [M] CF - Size: 159.97MB, KeyIn: 162M, KeyDrop: 139M, >>>> Flush: 10.335GB <-- with this workload this is pg log and dup op kv >>>> entries >>>> - [L] CF - Size: 1.39MB, KeyIn: 201K, KeyDrop: 89K, >>>> Flush: 0.355GB <-- deferred writes - min_pg_log_entries >>>> = >>>> 3000, trim = default iops = 28.3K >>>> - Default CF - Size: 23.13MB, KeyIn: 25M, KeyDrop: 19M, >>>> Flush: 8.762GB >>>> - [M] CF - Size: 159.97MB, KeyIn: 202M, KeyDrop: 175M, >>>> Flush: 16.890GB <-- with this workload this is pg log and dup op kv >>>> entries >>>> - [L] CF - Size: 0.86MB, KeyIn: 201K, KeyDrop: 89K, >>>> Flush: 0.355GB <-- deferred writes >>>> - default min_pg_log_entries = 1500, trim = default, iops = >>>> 26.6K >>>> - Default CF - Size: 65.63MB, KeyIn: 22M, KeyDrop: 17M, >>>> Flush: 7.858GB >>>> - [M] CF - Size: 118.09MB, KeyIn: 302M, KeyDrop: 269M, >>>> Flush: 15.847GB <-- with this workload this is pg log and dup op kv >>>> entries >>>> - [L] CF - Size: 1.00MB, KeyIn: 181K, KeyDrop: 80K, >>>> Flush: 0.320GB <-- deferred writes >>>> - min_pg_log_entries = 10, trim = 10, iops = 24.2K >>>> - Default CF - Size: 23.15MB, KeyIn: 21M, KeyDrop: 16M, >>>> Flush: 7.538GB >>>> - [M] CF - Size: 60.89MB, KeyIn: 277M, KeyDrop: 250M, >>>> Flush: 8.884GB <-- with this workload this is pg log and dup op kv >>>> entries >>>> - [L] CF - Size: 1.12MB, KeyIn: 188K, KeyDrop: 83K, >>>> Flush: 0.331GB <-- deferred writes >>>> - min_pg_log_entries = 1, trim = 1, iops = 23.8K >>>> - Default CF - Size: 68.58MB, KeyIn: 22M, KeyDrop: 17M, >>>> Flush: 7.936GB >>>> - [M] CF - Size: 96.39MB, KeyIn: 302M, KeyDrop: 262M, >>>> Flush: 9.289GB <-- with this workload this is pg log and dup op kv >>>> entries >>>> - [L] CF - Size: 1.04MB, KeyIn: 209K, KeyDrop: 92K, >>>> Flush: 0.368GB <-- deferred writes >>>> - min_pg_log_entires = 3000, trim = 1, iops = 25.8K >> >> Hi Mark, do you extract above results from compaction stats in Rocksdb >> LOG? > > > Correct, except for the IOPS numbers which were from the client benchmark. > > >> >> ** Compaction Stats [default] ** >> Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) >> Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) >> Avg(sec) KeyIn KeyDrop >> >> ---------------------------------------------------------------------------------------------------------------------------------------------------------- >> L0 6/0 270.47 MB 1.1 0.0 0.0 0.0 0.2 >> 0.2 0.0 1.0 0.0 154.3 1 4 0.329 >> 0 0 >> L1 3/0 190.94 MB 0.7 0.0 0.0 0.0 0.0 >> 0.0 0.0 0.0 0.0 0.0 0 0 0.000 >> 0 0 >> Sum 9/0 461.40 MB 0.0 0.0 0.0 0.0 0.2 >> 0.2 0.0 1.0 0.0 154.3 1 4 0.329 >> 0 0 >> Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.2 >> 0.2 0.0 1.0 0.0 154.3 1 4 0.329 >> 0 0 >> Uptime(secs): 9.9 total, 9.9 interval >> Flush(GB): cumulative 0.198, interval 0.198 >> >>> Note specifically how the KeyIn rate drops with the min_pg_log_entries >>> increased (ie disable dup_ops) and hacking out pginfo. I suspect that >>> commenting out log_operation would reduce the KeyIn rate significantly >>> further. Again these drives can absorb a lot of this so the improvement >>> in >>> iops is fairly modest (and setting min_pg_log_entries low actually >>> hurts!), >>> but this isn't just about performance, it's about the behavior that we >>> invoke. The Petstore results absolutely show us that on very fast >>> storage >>> we see a dramatic CPU usage reduction by removing log_operation and >>> pginfo, >>> so I think we should focus on what kind of behavior we want >>> pglog/pginfo/dup_ops to invoke. >>> >>> Mark >>> >>> >>>> >>>> * >>>>> >>>>> >>>>> >>>>> >>>>>> Cheers, >>>>>> Li Wang >>>>>> >>>>>> 2018-04-02 13:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>: >>>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> Based on your above discussion about pglog, I have the following >>>>>>> rough >>>>>>> design. Please help to give your suggestions. >>>>>>> >>>>>>> There will be three partitions: raw part for customer IOs, Bluefs for >>>>>>> Rocksdb, and pglog partition. >>>>>>> The former two partitions are same as current. The pglog partition is >>>>>>> splitted into 1M blocks. We allocate blocks for ring buffers per pg. >>>>>>> We will have such following data: >>>>>>> >>>>>>> Allocation bitmap (just in memory) >>>>>>> >>>>>>> The pglog partition has a bitmap to record which block is allocated >>>>>>> or >>>>>>> not. We can rebuild it through pg->allocated_block_list when >>>>>>> starting, >>>>>>> and no need to store it in persistent disk. But we will store basic >>>>>>> information about the pglog partition in Rocksdb, like block size, >>>>>>> block number etc when the objectstore is initialized. >>>>>>> >>>>>>> Pg -> allocated_blocks_list >>>>>>> >>>>>>> When a pg is created and IOs start, we can allocate a block for every >>>>>>> pg. Every pglog entry is less than 300 bytes, 1M can store 3495 >>>>>>> entries. When total pglog entries increase and exceed the number, we >>>>>>> can add a new block to the pg. >>>>>>> >>>>>>> Pg->start_position >>>>>>> >>>>>>> Record the oldest valid entry per pg. >>>>>>> >>>>>>> Pg->next_position >>>>>>> >>>>>>> Record the next entry to add per pg. The data will be updated >>>>>>> frequently, but Rocksdb is suitable for its io mode, and most of >>>>>>> data will be merged. >>>>>>> >>>>>>> Updated Bluestore write progess: >>>>>>> >>>>>>> When writing data to disk (before metadata updating), we can append >>>>>>> the pglog entry to its ring buffer in parallel. >>>>>>> After that, submit pg ring buffer changes like pg->next_position, and >>>>>>> current other metadata changes to Rocksdb. >>>>>>> >>>>>>> >>>>>>> On Fri, Mar 30, 2018 at 6:23 PM, Varada Kari <varada.kari@xxxxxxxxx> >>>>>>> wrote: >>>>>>>> >>>>>>>> On Fri, Mar 30, 2018 at 1:01 PM, Li Wang <laurence.liwang@xxxxxxxxx> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> If we wanna store pg log in a standalone ring buffer, another >>>>>>>>> candidate >>>>>>>>> is the deferred write, why not use the ring buffer as the journal >>>>>>>>> for >>>>>>>>> 4K random >>>>>>>>> write, it should be much more lightweight than rocksdb >>>>>>>>> >>>>>>>> It will be similar to FileStore implementation, for small writes. >>>>>>>> That >>>>>>>> comes with the same alignment issues and given >>>>>>>> write amplification. Rocksdb nicely abstracts that and we don't make >>>>>>>> it to L0 files because of WAL handling. >>>>>>>> >>>>>>>> Varada >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Li Wang >>>>>>>>> >>>>>>>>> >>>>>>>>> 2018-03-30 4:04 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>: >>>>>>>>>> >>>>>>>>>> On Wed, 28 Mar 2018, Matt Benjamin wrote: >>>>>>>>>>> >>>>>>>>>>> On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson <mnelson@xxxxxxxxxx> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On 03/28/2018 12:21 PM, Adam C. Emerson wrote: >>>>>>>>>>>> >>>>>>>>>>>> 2) It sure feels like conceptually the pglog should be >>>>>>>>>>>> represented >>>>>>>>>>>> as a >>>>>>>>>>>> per-pg ring buffer rather than key/value data. Maybe there are >>>>>>>>>>>> really >>>>>>>>>>>> important reasons that it shouldn't be, but I don't currently >>>>>>>>>>>> see >>>>>>>>>>>> them. As >>>>>>>>>>>> far as the objectstore is concerned, it seems to me like there >>>>>>>>>>>> are >>>>>>>>>>>> valid >>>>>>>>>>>> reasons to provide some kind of log interface and perhaps that >>>>>>>>>>>> should be >>>>>>>>>>>> used for pg_log. That sort of opens the door for different >>>>>>>>>>>> object >>>>>>>>>>>> store >>>>>>>>>>>> implementations fulfilling that functionality in whatever ways >>>>>>>>>>>> the >>>>>>>>>>>> author >>>>>>>>>>>> deems fit. >>>>>>>>>>> >>>>>>>>>>> In the reddit lingo, pretty much this. We should be >>>>>>>>>>> concentrating >>>>>>>>>>> on >>>>>>>>>>> this direction, or ruling it out. >>>>>>>>>> >>>>>>>>>> Yeah, +1 >>>>>>>>>> >>>>>>>>>> It seems like step 1 is a proof of concept branch that encodes >>>>>>>>>> pg_log_entry_t's and writes them to a simple ring buffer. The >>>>>>>>>> first >>>>>>>>>> questions to answer is (a) whether this does in fact improve >>>>>>>>>> things >>>>>>>>>> significantly and (b) whether we want to have an independent ring >>>>>>>>>> buffer >>>>>>>>>> for each PG or try to mix them into one big one for the whole OSD >>>>>>>>>> (or >>>>>>>>>> maybe per shard). >>>>>>>>>> >>>>>>>>>> The second question is how that fares on HDDs. My guess is that >>>>>>>>>> the >>>>>>>>>> current rocksdb strategy is better because it reduces the number >>>>>>>>>> of >>>>>>>>>> IOs >>>>>>>>>> and the additional data getting compacted (and CPU usage) isn't >>>>>>>>>> the >>>>>>>>>> limiting factor on HDD perforamnce (IOPS are). (But maybe we'll >>>>>>>>>> get >>>>>>>>>> lucky >>>>>>>>>> and the new strategy will be best for both HDD and SSD..) >>>>>>>>>> >>>>>>>>>> Then we have to modify PGLog to be a complete implementation. A >>>>>>>>>> strict >>>>>>>>>> ring buffer probably won't work because the PG log might not trim >>>>>>>>>> and >>>>>>>>>> because log entries are variable length, so there'll probably need >>>>>>>>>> to be >>>>>>>>>> some simple mapping table (vs a trivial start/end ring buffer >>>>>>>>>> position) to >>>>>>>>>> deal with that. We have to trim the log periodically, so every so >>>>>>>>>> many >>>>>>>>>> entries we may want to realign with a min_alloc_size boundary. We >>>>>>>>>> someones have to back up and rewrite divergent portions of the log >>>>>>>>>> (during >>>>>>>>>> peering) so we'll need to sort out whether that is a complete >>>>>>>>>> reencode/rewrite or whether we keep encoded entries in ram >>>>>>>>>> (individually >>>>>>>>>> or in chunks), etc etc. >>>>>>>>>> >>>>>>>>>> sage >>>>>>>>>> -- >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>>> ceph-devel" in >>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best wishes >>>>>>> Lisa >>>>>> >>>>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> > -- Best wishes Lisa -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html