On Wed, Jun 20, 2018 at 7:55 PM, Mark Nelson <mark.a.nelson@xxxxxxxxx> wrote: > Hi Lisa, > > > On 06/20/2018 03:19 AM, xiaoyan li wrote: >> >> Hi all, >> I wrote a poc to split pglog from Rocksdb and store them into >> standalone space in the block device. > > > Excellent! This is very exciting! > >> The updates are done in OSD and BlueStore: >> >> OSD parts: >> 1. Split pglog entries and pglog info from omaps. >> BlueStore: >> 1. Allocate 16M space in block device per PG for storing pglog. >> 2. Per every transaction from OSD, combine pglog entries and >> pglog info, and write them into a block. The block is set to 4k at >> this moment. >> >> Currently, I only make the write workflow work. >> With librbd+fio on a cluster with an OSD (on Intel Optane 370G), I got >> the following performance for 4k random writes, and the performance >> got 13.87% better. >> >> Master: >> write: IOPS=48.3k, BW=189MiB/s (198MB/s)(55.3GiB/300009msec) >> slat (nsec): min=1032, max=1683.2k, avg=4345.13, stdev=3988.69 >> clat (msec): min=3, max=123, avg=10.60, stdev= 8.31 >> lat (msec): min=3, max=123, avg=10.60, stdev= 8.31 >> >> Pgsplit branch: >> write: IOPS=55.0k, BW=215MiB/s (225MB/s)(62.0GiB/300010msec) >> slat (nsec): min=1068, max=1339.7k, avg=4360.58, stdev=3878.47 >> clat (msec): min=2, max=120, avg= 9.30, stdev= 6.92 >> lat (msec): min=2, max=120, avg= 9.31, stdev= 6.92 > > > These are better numbers than I typically get! I'll play with your branch > but usually I see us pegged in this workload in the kv_sync_thread. Did you > notice any significant change in CPU consumption? > >> >> Here is the POC: >> https://github.com/lixiaoy1/ceph/commits/pglog-split-fastinfo >> The problem is that per every transaction, I use a 4k block to save >> the pglog entries and pglog info which is only 130+920 = 1050 bytes. >> This wastes a lot of space. >> Any suggestions? > > > I guess 100*3000*4k = ~1.2GB? > > Josh, based on our discussion last week, the size of entries is going to be > variable right? > > Another approach you could try could be to write all updates to a single log > that you periodically rotate. You keep a reference count of what entries > are where and when all references to a given log drop off you delete it. > The space amplification could potentially be quite high in the pathological > case (100*3000*sizeof(log)!), but you could do so a really coarse grain > compaction if that really proved to be a problem. It'd be better for HDD, > but your current approach might still be better on flash. Thanks for your comments, Mark. I need to write pg entry and pglog info into block device per every transaction. These are write IOs smaller than 4K. Do you mean at first I write them in 4k block size, and later compact them into final place? The space amplification problem can be decreased if I use 512 block size. I can split the space into 4k stripe, and delete logs in unit of the stripe. In a stripe, pglog are appended. So no need to use fixed space for pglog. And space can be nearly (100*3000*sizeof(log). > > Mark > > >> >> Best wishes >> Lisa >> >> On Thu, Apr 5, 2018 at 12:09 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote: >>> >>> >>> >>> On 04/03/2018 09:36 PM, xiaoyan li wrote: >>>> >>>> >>>> On Tue, Apr 3, 2018 at 11:15 PM, Mark Nelson <mark.a.nelson@xxxxxxxxx> >>>> wrote: >>>>> >>>>> >>>>> >>>>> On 04/03/2018 09:56 AM, Mark Nelson wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 04/03/2018 08:27 AM, Sage Weil wrote: >>>>>>> >>>>>>> >>>>>>> On Tue, 3 Apr 2018, Li Wang wrote: >>>>>>>> >>>>>>>> >>>>>>>> Hi, >>>>>>>> Before we move forward, could someone give a test such that >>>>>>>> the pglog not written into rocksdb at all, to see how much is the >>>>>>>> performance improvement as the upper bound, it shoule be less than >>>>>>>> turning on the bluestore_debug_omit_kv_commit >>>>>>> >>>>>>> >>>>>>> +1 >>>>>>> >>>>>>> (The PetStore behavior doesn't tell us anything about how BlueStore >>>>>>> will >>>>>>> behave without the pglog overhead.) >>>>>>> >>>>>>> sage >>>>>> >>>>>> >>>>>> >>>>>> We do have some testing of the bluestore's behavior, though it's about >>>>>> 6 >>>>>> months old now: >>>>>> >>>>>> - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD >>>>>> >>>>>> - 128 PGs >>>>>> >>>>>> - stats are sloppy since they only appear every ~10 mins >>>>>> >>>>>> *- default min_pg_log_entries = 1500, trim = default, iops = 26.6K* >>>>>> >>>>>> - Default CF - Size: 65.63MB, KeyIn: 22M, KeyDrop: 17M, Flush: >>>>>> 7.858GB >>>>>> >>>>>> - [M] CF - Size: 118.09MB, KeyIn: 302M, KeyDrop: 269M, Flush: >>>>>> 15.847GB <-- with this workload this is pg log and dup op kv entries >>>>>> >>>>>> - [L] CF - Size: 1.00MB, KeyIn: 181K, KeyDrop: 80K, Flush: >>>>>> 0.320GB <-- deferred writes*- min_pg_log_entries = 10, trim = 10, iops >>>>>> = >>>>>> 24.2K* >>>>>> >>>>>> - Default CF - Size: 23.15MB, KeyIn: 21M, KeyDrop: 16M, Flush: >>>>>> 7.538GB >>>>>> >>>>>> - [M] CF - Size: 60.89MB, KeyIn: 277M, KeyDrop: 250M, Flush: >>>>>> 8.884GB <-- with this workload this is pg log and dup op kv entries >>>>>> >>>>>> - [L] CF - Size: 1.12MB, KeyIn: 188K, KeyDrop: 83K, Flush: >>>>>> 0.331GB <-- deferred writes - min_pg_log_entries = 1, trim = 1, *iops >>>>>> = >>>>>> 23.8K* >>>>>> >>>>>> - Default CF - Size: 68.58MB, KeyIn: 22M, KeyDrop: 17M, Flush: >>>>>> 7.936GB >>>>>> >>>>>> - [M] CF - Size: 96.39MB, KeyIn: 302M, KeyDrop: 262M, Flush: >>>>>> 9.289GB <-- with this workload this is pg log and dup op kv entries >>>>>> >>>>>> - [L] CF - Size: 1.04MB, KeyIn: 209K, KeyDrop: 92K, Flush: >>>>>> 0.368GB <-- deferred writes >>>>>> >>>>>> - min_pg_log_entires = 3000, trim = 1, *iops = 25.8K* >>>>>> >>>>>> * >>>>>> The actual performance variation here I think is much less important >>>>>> than >>>>>> the KeyIn behavior. The NVMe devices in these tests are fast enough >>>>>> to >>>>>> absorb a fair amount of overhead. >>>>> >>>>> >>>>> >>>>> Ugh, sorry. That will teach me to talk in meeting and paste at the >>>>> same >>>>> time. Those were the wrong stats. Here are the right ones: >>>>> >>>>>> - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD >>>>>> - 128 PGs >>>>>> - stats are sloppy since they only appear every ~10 mins >>>>>> - min_pg_log_entries = 3000, trim = default, pginfo hack, >>>>>> iops >>>>>> = >>>>>> 27.8K >>>>>> - Default CF - Size: 23.15MB, KeyIn: 24M, KeyDrop: >>>>>> 19M, >>>>>> Flush: 8.662GB >>>>>> - [M] CF - Size: 159.97MB, KeyIn: 162M, KeyDrop: >>>>>> 139M, >>>>>> Flush: 10.335GB <-- with this workload this is pg log and dup op kv >>>>>> entries >>>>>> - [L] CF - Size: 1.39MB, KeyIn: 201K, KeyDrop: >>>>>> 89K, >>>>>> Flush: 0.355GB <-- deferred writes - >>>>>> min_pg_log_entries >>>>>> = >>>>>> 3000, trim = default iops = 28.3K >>>>>> - Default CF - Size: 23.13MB, KeyIn: 25M, KeyDrop: >>>>>> 19M, >>>>>> Flush: 8.762GB >>>>>> - [M] CF - Size: 159.97MB, KeyIn: 202M, KeyDrop: >>>>>> 175M, >>>>>> Flush: 16.890GB <-- with this workload this is pg log and dup op kv >>>>>> entries >>>>>> - [L] CF - Size: 0.86MB, KeyIn: 201K, KeyDrop: >>>>>> 89K, >>>>>> Flush: 0.355GB <-- deferred writes >>>>>> - default min_pg_log_entries = 1500, trim = default, iops = >>>>>> 26.6K >>>>>> - Default CF - Size: 65.63MB, KeyIn: 22M, KeyDrop: >>>>>> 17M, >>>>>> Flush: 7.858GB >>>>>> - [M] CF - Size: 118.09MB, KeyIn: 302M, KeyDrop: >>>>>> 269M, >>>>>> Flush: 15.847GB <-- with this workload this is pg log and dup op kv >>>>>> entries >>>>>> - [L] CF - Size: 1.00MB, KeyIn: 181K, KeyDrop: >>>>>> 80K, >>>>>> Flush: 0.320GB <-- deferred writes >>>>>> - min_pg_log_entries = 10, trim = 10, iops = 24.2K >>>>>> - Default CF - Size: 23.15MB, KeyIn: 21M, KeyDrop: >>>>>> 16M, >>>>>> Flush: 7.538GB >>>>>> - [M] CF - Size: 60.89MB, KeyIn: 277M, KeyDrop: >>>>>> 250M, >>>>>> Flush: 8.884GB <-- with this workload this is pg log and dup op kv >>>>>> entries >>>>>> - [L] CF - Size: 1.12MB, KeyIn: 188K, KeyDrop: >>>>>> 83K, >>>>>> Flush: 0.331GB <-- deferred writes >>>>>> - min_pg_log_entries = 1, trim = 1, iops = 23.8K >>>>>> - Default CF - Size: 68.58MB, KeyIn: 22M, KeyDrop: >>>>>> 17M, >>>>>> Flush: 7.936GB >>>>>> - [M] CF - Size: 96.39MB, KeyIn: 302M, KeyDrop: >>>>>> 262M, >>>>>> Flush: 9.289GB <-- with this workload this is pg log and dup op kv >>>>>> entries >>>>>> - [L] CF - Size: 1.04MB, KeyIn: 209K, KeyDrop: >>>>>> 92K, >>>>>> Flush: 0.368GB <-- deferred writes >>>>>> - min_pg_log_entires = 3000, trim = 1, iops = 25.8K >>>> >>>> >>>> Hi Mark, do you extract above results from compaction stats in Rocksdb >>>> LOG? >>> >>> >>> >>> Correct, except for the IOPS numbers which were from the client >>> benchmark. >>> >>> >>>> >>>> ** Compaction Stats [default] ** >>>> Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) >>>> Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) >>>> Avg(sec) KeyIn KeyDrop >>>> >>>> >>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> L0 6/0 270.47 MB 1.1 0.0 0.0 0.0 0.2 >>>> 0.2 0.0 1.0 0.0 154.3 1 4 0.329 >>>> 0 0 >>>> L1 3/0 190.94 MB 0.7 0.0 0.0 0.0 0.0 >>>> 0.0 0.0 0.0 0.0 0.0 0 0 0.000 >>>> 0 0 >>>> Sum 9/0 461.40 MB 0.0 0.0 0.0 0.0 0.2 >>>> 0.2 0.0 1.0 0.0 154.3 1 4 0.329 >>>> 0 0 >>>> Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.2 >>>> 0.2 0.0 1.0 0.0 154.3 1 4 0.329 >>>> 0 0 >>>> Uptime(secs): 9.9 total, 9.9 interval >>>> Flush(GB): cumulative 0.198, interval 0.198 >>>> >>>>> Note specifically how the KeyIn rate drops with the min_pg_log_entries >>>>> increased (ie disable dup_ops) and hacking out pginfo. I suspect that >>>>> commenting out log_operation would reduce the KeyIn rate significantly >>>>> further. Again these drives can absorb a lot of this so the >>>>> improvement >>>>> in >>>>> iops is fairly modest (and setting min_pg_log_entries low actually >>>>> hurts!), >>>>> but this isn't just about performance, it's about the behavior that we >>>>> invoke. The Petstore results absolutely show us that on very fast >>>>> storage >>>>> we see a dramatic CPU usage reduction by removing log_operation and >>>>> pginfo, >>>>> so I think we should focus on what kind of behavior we want >>>>> pglog/pginfo/dup_ops to invoke. >>>>> >>>>> Mark >>>>> >>>>> >>>>>> >>>>>> * >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> Cheers, >>>>>>>> Li Wang >>>>>>>> >>>>>>>> 2018-04-02 13:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>: >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> Based on your above discussion about pglog, I have the following >>>>>>>>> rough >>>>>>>>> design. Please help to give your suggestions. >>>>>>>>> >>>>>>>>> There will be three partitions: raw part for customer IOs, Bluefs >>>>>>>>> for >>>>>>>>> Rocksdb, and pglog partition. >>>>>>>>> The former two partitions are same as current. The pglog partition >>>>>>>>> is >>>>>>>>> splitted into 1M blocks. We allocate blocks for ring buffers per >>>>>>>>> pg. >>>>>>>>> We will have such following data: >>>>>>>>> >>>>>>>>> Allocation bitmap (just in memory) >>>>>>>>> >>>>>>>>> The pglog partition has a bitmap to record which block is allocated >>>>>>>>> or >>>>>>>>> not. We can rebuild it through pg->allocated_block_list when >>>>>>>>> starting, >>>>>>>>> and no need to store it in persistent disk. But we will store basic >>>>>>>>> information about the pglog partition in Rocksdb, like block size, >>>>>>>>> block number etc when the objectstore is initialized. >>>>>>>>> >>>>>>>>> Pg -> allocated_blocks_list >>>>>>>>> >>>>>>>>> When a pg is created and IOs start, we can allocate a block for >>>>>>>>> every >>>>>>>>> pg. Every pglog entry is less than 300 bytes, 1M can store 3495 >>>>>>>>> entries. When total pglog entries increase and exceed the number, >>>>>>>>> we >>>>>>>>> can add a new block to the pg. >>>>>>>>> >>>>>>>>> Pg->start_position >>>>>>>>> >>>>>>>>> Record the oldest valid entry per pg. >>>>>>>>> >>>>>>>>> Pg->next_position >>>>>>>>> >>>>>>>>> Record the next entry to add per pg. The data will be updated >>>>>>>>> frequently, but Rocksdb is suitable for its io mode, and most of >>>>>>>>> data will be merged. >>>>>>>>> >>>>>>>>> Updated Bluestore write progess: >>>>>>>>> >>>>>>>>> When writing data to disk (before metadata updating), we can append >>>>>>>>> the pglog entry to its ring buffer in parallel. >>>>>>>>> After that, submit pg ring buffer changes like pg->next_position, >>>>>>>>> and >>>>>>>>> current other metadata changes to Rocksdb. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Mar 30, 2018 at 6:23 PM, Varada Kari >>>>>>>>> <varada.kari@xxxxxxxxx> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Mar 30, 2018 at 1:01 PM, Li Wang >>>>>>>>>> <laurence.liwang@xxxxxxxxx> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> If we wanna store pg log in a standalone ring buffer, >>>>>>>>>>> another >>>>>>>>>>> candidate >>>>>>>>>>> is the deferred write, why not use the ring buffer as the journal >>>>>>>>>>> for >>>>>>>>>>> 4K random >>>>>>>>>>> write, it should be much more lightweight than rocksdb >>>>>>>>>>> >>>>>>>>>> It will be similar to FileStore implementation, for small writes. >>>>>>>>>> That >>>>>>>>>> comes with the same alignment issues and given >>>>>>>>>> write amplification. Rocksdb nicely abstracts that and we don't >>>>>>>>>> make >>>>>>>>>> it to L0 files because of WAL handling. >>>>>>>>>> >>>>>>>>>> Varada >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Li Wang >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2018-03-30 4:04 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, 28 Mar 2018, Matt Benjamin wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson >>>>>>>>>>>>> <mnelson@xxxxxxxxxx> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 03/28/2018 12:21 PM, Adam C. Emerson wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2) It sure feels like conceptually the pglog should be >>>>>>>>>>>>>> represented >>>>>>>>>>>>>> as a >>>>>>>>>>>>>> per-pg ring buffer rather than key/value data. Maybe there >>>>>>>>>>>>>> are >>>>>>>>>>>>>> really >>>>>>>>>>>>>> important reasons that it shouldn't be, but I don't currently >>>>>>>>>>>>>> see >>>>>>>>>>>>>> them. As >>>>>>>>>>>>>> far as the objectstore is concerned, it seems to me like there >>>>>>>>>>>>>> are >>>>>>>>>>>>>> valid >>>>>>>>>>>>>> reasons to provide some kind of log interface and perhaps that >>>>>>>>>>>>>> should be >>>>>>>>>>>>>> used for pg_log. That sort of opens the door for different >>>>>>>>>>>>>> object >>>>>>>>>>>>>> store >>>>>>>>>>>>>> implementations fulfilling that functionality in whatever ways >>>>>>>>>>>>>> the >>>>>>>>>>>>>> author >>>>>>>>>>>>>> deems fit. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> In the reddit lingo, pretty much this. We should be >>>>>>>>>>>>> concentrating >>>>>>>>>>>>> on >>>>>>>>>>>>> this direction, or ruling it out. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Yeah, +1 >>>>>>>>>>>> >>>>>>>>>>>> It seems like step 1 is a proof of concept branch that encodes >>>>>>>>>>>> pg_log_entry_t's and writes them to a simple ring buffer. The >>>>>>>>>>>> first >>>>>>>>>>>> questions to answer is (a) whether this does in fact improve >>>>>>>>>>>> things >>>>>>>>>>>> significantly and (b) whether we want to have an independent >>>>>>>>>>>> ring >>>>>>>>>>>> buffer >>>>>>>>>>>> for each PG or try to mix them into one big one for the whole >>>>>>>>>>>> OSD >>>>>>>>>>>> (or >>>>>>>>>>>> maybe per shard). >>>>>>>>>>>> >>>>>>>>>>>> The second question is how that fares on HDDs. My guess is that >>>>>>>>>>>> the >>>>>>>>>>>> current rocksdb strategy is better because it reduces the number >>>>>>>>>>>> of >>>>>>>>>>>> IOs >>>>>>>>>>>> and the additional data getting compacted (and CPU usage) isn't >>>>>>>>>>>> the >>>>>>>>>>>> limiting factor on HDD perforamnce (IOPS are). (But maybe we'll >>>>>>>>>>>> get >>>>>>>>>>>> lucky >>>>>>>>>>>> and the new strategy will be best for both HDD and SSD..) >>>>>>>>>>>> >>>>>>>>>>>> Then we have to modify PGLog to be a complete implementation. A >>>>>>>>>>>> strict >>>>>>>>>>>> ring buffer probably won't work because the PG log might not >>>>>>>>>>>> trim >>>>>>>>>>>> and >>>>>>>>>>>> because log entries are variable length, so there'll probably >>>>>>>>>>>> need >>>>>>>>>>>> to be >>>>>>>>>>>> some simple mapping table (vs a trivial start/end ring buffer >>>>>>>>>>>> position) to >>>>>>>>>>>> deal with that. We have to trim the log periodically, so every >>>>>>>>>>>> so >>>>>>>>>>>> many >>>>>>>>>>>> entries we may want to realign with a min_alloc_size boundary. >>>>>>>>>>>> We >>>>>>>>>>>> someones have to back up and rewrite divergent portions of the >>>>>>>>>>>> log >>>>>>>>>>>> (during >>>>>>>>>>>> peering) so we'll need to sort out whether that is a complete >>>>>>>>>>>> reencode/rewrite or whether we keep encoded entries in ram >>>>>>>>>>>> (individually >>>>>>>>>>>> or in chunks), etc etc. >>>>>>>>>>>> >>>>>>>>>>>> sage >>>>>>>>>>>> -- >>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>>>>> ceph-devel" in >>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>>>> More majordomo info at >>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Best wishes >>>>>>>>> Lisa >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>> in >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>>> >>>> >>> >> >> >> > -- Best wishes Lisa -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html