On Tue, Apr 3, 2018 at 11:15 PM, Mark Nelson <mark.a.nelson@xxxxxxxxx> wrote: > > > On 04/03/2018 09:56 AM, Mark Nelson wrote: >> >> >> >> On 04/03/2018 08:27 AM, Sage Weil wrote: >>> >>> On Tue, 3 Apr 2018, Li Wang wrote: >>>> >>>> Hi, >>>> Before we move forward, could someone give a test such that >>>> the pglog not written into rocksdb at all, to see how much is the >>>> performance improvement as the upper bound, it shoule be less than >>>> turning on the bluestore_debug_omit_kv_commit >>> >>> +1 >>> >>> (The PetStore behavior doesn't tell us anything about how BlueStore will >>> behave without the pglog overhead.) >>> >>> sage >> >> >> We do have some testing of the bluestore's behavior, though it's about 6 >> months old now: >> >> - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD >> >> - 128 PGs >> >> - stats are sloppy since they only appear every ~10 mins >> >> *- default min_pg_log_entries = 1500, trim = default, iops = 26.6K* >> >> - Default CF - Size: 65.63MB, KeyIn: 22M, KeyDrop: 17M, Flush: >> 7.858GB >> >> - [M] CF - Size: 118.09MB, KeyIn: 302M, KeyDrop: 269M, Flush: >> 15.847GB <-- with this workload this is pg log and dup op kv entries >> >> - [L] CF - Size: 1.00MB, KeyIn: 181K, KeyDrop: 80K, Flush: >> 0.320GB <-- deferred writes*- min_pg_log_entries = 10, trim = 10, iops = >> 24.2K* >> >> - Default CF - Size: 23.15MB, KeyIn: 21M, KeyDrop: 16M, Flush: >> 7.538GB >> >> - [M] CF - Size: 60.89MB, KeyIn: 277M, KeyDrop: 250M, Flush: >> 8.884GB <-- with this workload this is pg log and dup op kv entries >> >> - [L] CF - Size: 1.12MB, KeyIn: 188K, KeyDrop: 83K, Flush: >> 0.331GB <-- deferred writes - min_pg_log_entries = 1, trim = 1, *iops = >> 23.8K* >> >> - Default CF - Size: 68.58MB, KeyIn: 22M, KeyDrop: 17M, Flush: >> 7.936GB >> >> - [M] CF - Size: 96.39MB, KeyIn: 302M, KeyDrop: 262M, Flush: >> 9.289GB <-- with this workload this is pg log and dup op kv entries >> >> - [L] CF - Size: 1.04MB, KeyIn: 209K, KeyDrop: 92K, Flush: >> 0.368GB <-- deferred writes >> >> - min_pg_log_entires = 3000, trim = 1, *iops = 25.8K* >> >> * >> The actual performance variation here I think is much less important than >> the KeyIn behavior. The NVMe devices in these tests are fast enough to >> absorb a fair amount of overhead. > > > Ugh, sorry. That will teach me to talk in meeting and paste at the same > time. Those were the wrong stats. Here are the right ones: > >> - ~1 hour 4K random overwrites to RBD on 1 NVMe OSD >> - 128 PGs >> - stats are sloppy since they only appear every ~10 mins >> - min_pg_log_entries = 3000, trim = default, pginfo hack, iops = >> 27.8K >> - Default CF - Size: 23.15MB, KeyIn: 24M, KeyDrop: 19M, >> Flush: 8.662GB >> - [M] CF - Size: 159.97MB, KeyIn: 162M, KeyDrop: 139M, >> Flush: 10.335GB <-- with this workload this is pg log and dup op kv entries >> - [L] CF - Size: 1.39MB, KeyIn: 201K, KeyDrop: 89K, >> Flush: 0.355GB <-- deferred writes - min_pg_log_entries = >> 3000, trim = default iops = 28.3K >> - Default CF - Size: 23.13MB, KeyIn: 25M, KeyDrop: 19M, >> Flush: 8.762GB >> - [M] CF - Size: 159.97MB, KeyIn: 202M, KeyDrop: 175M, >> Flush: 16.890GB <-- with this workload this is pg log and dup op kv entries >> - [L] CF - Size: 0.86MB, KeyIn: 201K, KeyDrop: 89K, >> Flush: 0.355GB <-- deferred writes >> - default min_pg_log_entries = 1500, trim = default, iops = 26.6K >> - Default CF - Size: 65.63MB, KeyIn: 22M, KeyDrop: 17M, >> Flush: 7.858GB >> - [M] CF - Size: 118.09MB, KeyIn: 302M, KeyDrop: 269M, >> Flush: 15.847GB <-- with this workload this is pg log and dup op kv entries >> - [L] CF - Size: 1.00MB, KeyIn: 181K, KeyDrop: 80K, >> Flush: 0.320GB <-- deferred writes >> - min_pg_log_entries = 10, trim = 10, iops = 24.2K >> - Default CF - Size: 23.15MB, KeyIn: 21M, KeyDrop: 16M, >> Flush: 7.538GB >> - [M] CF - Size: 60.89MB, KeyIn: 277M, KeyDrop: 250M, >> Flush: 8.884GB <-- with this workload this is pg log and dup op kv entries >> - [L] CF - Size: 1.12MB, KeyIn: 188K, KeyDrop: 83K, >> Flush: 0.331GB <-- deferred writes >> - min_pg_log_entries = 1, trim = 1, iops = 23.8K >> - Default CF - Size: 68.58MB, KeyIn: 22M, KeyDrop: 17M, >> Flush: 7.936GB >> - [M] CF - Size: 96.39MB, KeyIn: 302M, KeyDrop: 262M, >> Flush: 9.289GB <-- with this workload this is pg log and dup op kv entries >> - [L] CF - Size: 1.04MB, KeyIn: 209K, KeyDrop: 92K, >> Flush: 0.368GB <-- deferred writes >> - min_pg_log_entires = 3000, trim = 1, iops = 25.8K > Hi Mark, do you extract above results from compaction stats in Rocksdb LOG? ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop ---------------------------------------------------------------------------------------------------------------------------------------------------------- L0 6/0 270.47 MB 1.1 0.0 0.0 0.0 0.2 0.2 0.0 1.0 0.0 154.3 1 4 0.329 0 0 L1 3/0 190.94 MB 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 Sum 9/0 461.40 MB 0.0 0.0 0.0 0.0 0.2 0.2 0.0 1.0 0.0 154.3 1 4 0.329 0 0 Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.2 0.2 0.0 1.0 0.0 154.3 1 4 0.329 0 0 Uptime(secs): 9.9 total, 9.9 interval Flush(GB): cumulative 0.198, interval 0.198 > > Note specifically how the KeyIn rate drops with the min_pg_log_entries > increased (ie disable dup_ops) and hacking out pginfo. I suspect that > commenting out log_operation would reduce the KeyIn rate significantly > further. Again these drives can absorb a lot of this so the improvement in > iops is fairly modest (and setting min_pg_log_entries low actually hurts!), > but this isn't just about performance, it's about the behavior that we > invoke. The Petstore results absolutely show us that on very fast storage > we see a dramatic CPU usage reduction by removing log_operation and pginfo, > so I think we should focus on what kind of behavior we want > pglog/pginfo/dup_ops to invoke. > > Mark > > >> >> >> * >>> >>> >>> >>> >>>> Cheers, >>>> Li Wang >>>> >>>> 2018-04-02 13:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>: >>>>> >>>>> Hi all, >>>>> >>>>> Based on your above discussion about pglog, I have the following rough >>>>> design. Please help to give your suggestions. >>>>> >>>>> There will be three partitions: raw part for customer IOs, Bluefs for >>>>> Rocksdb, and pglog partition. >>>>> The former two partitions are same as current. The pglog partition is >>>>> splitted into 1M blocks. We allocate blocks for ring buffers per pg. >>>>> We will have such following data: >>>>> >>>>> Allocation bitmap (just in memory) >>>>> >>>>> The pglog partition has a bitmap to record which block is allocated or >>>>> not. We can rebuild it through pg->allocated_block_list when starting, >>>>> and no need to store it in persistent disk. But we will store basic >>>>> information about the pglog partition in Rocksdb, like block size, >>>>> block number etc when the objectstore is initialized. >>>>> >>>>> Pg -> allocated_blocks_list >>>>> >>>>> When a pg is created and IOs start, we can allocate a block for every >>>>> pg. Every pglog entry is less than 300 bytes, 1M can store 3495 >>>>> entries. When total pglog entries increase and exceed the number, we >>>>> can add a new block to the pg. >>>>> >>>>> Pg->start_position >>>>> >>>>> Record the oldest valid entry per pg. >>>>> >>>>> Pg->next_position >>>>> >>>>> Record the next entry to add per pg. The data will be updated >>>>> frequently, but Rocksdb is suitable for its io mode, and most of >>>>> data will be merged. >>>>> >>>>> Updated Bluestore write progess: >>>>> >>>>> When writing data to disk (before metadata updating), we can append >>>>> the pglog entry to its ring buffer in parallel. >>>>> After that, submit pg ring buffer changes like pg->next_position, and >>>>> current other metadata changes to Rocksdb. >>>>> >>>>> >>>>> On Fri, Mar 30, 2018 at 6:23 PM, Varada Kari <varada.kari@xxxxxxxxx> >>>>> wrote: >>>>>> >>>>>> On Fri, Mar 30, 2018 at 1:01 PM, Li Wang <laurence.liwang@xxxxxxxxx> >>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> If we wanna store pg log in a standalone ring buffer, another >>>>>>> candidate >>>>>>> is the deferred write, why not use the ring buffer as the journal for >>>>>>> 4K random >>>>>>> write, it should be much more lightweight than rocksdb >>>>>>> >>>>>> It will be similar to FileStore implementation, for small writes. That >>>>>> comes with the same alignment issues and given >>>>>> write amplification. Rocksdb nicely abstracts that and we don't make >>>>>> it to L0 files because of WAL handling. >>>>>> >>>>>> Varada >>>>>>> >>>>>>> Cheers, >>>>>>> Li Wang >>>>>>> >>>>>>> >>>>>>> 2018-03-30 4:04 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>: >>>>>>>> >>>>>>>> On Wed, 28 Mar 2018, Matt Benjamin wrote: >>>>>>>>> >>>>>>>>> On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson <mnelson@xxxxxxxxxx> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> On 03/28/2018 12:21 PM, Adam C. Emerson wrote: >>>>>>>>>> >>>>>>>>>> 2) It sure feels like conceptually the pglog should be represented >>>>>>>>>> as a >>>>>>>>>> per-pg ring buffer rather than key/value data. Maybe there are >>>>>>>>>> really >>>>>>>>>> important reasons that it shouldn't be, but I don't currently see >>>>>>>>>> them. As >>>>>>>>>> far as the objectstore is concerned, it seems to me like there are >>>>>>>>>> valid >>>>>>>>>> reasons to provide some kind of log interface and perhaps that >>>>>>>>>> should be >>>>>>>>>> used for pg_log. That sort of opens the door for different object >>>>>>>>>> store >>>>>>>>>> implementations fulfilling that functionality in whatever ways the >>>>>>>>>> author >>>>>>>>>> deems fit. >>>>>>>>> >>>>>>>>> In the reddit lingo, pretty much this. We should be concentrating >>>>>>>>> on >>>>>>>>> this direction, or ruling it out. >>>>>>>> >>>>>>>> Yeah, +1 >>>>>>>> >>>>>>>> It seems like step 1 is a proof of concept branch that encodes >>>>>>>> pg_log_entry_t's and writes them to a simple ring buffer. The first >>>>>>>> questions to answer is (a) whether this does in fact improve things >>>>>>>> significantly and (b) whether we want to have an independent ring >>>>>>>> buffer >>>>>>>> for each PG or try to mix them into one big one for the whole OSD >>>>>>>> (or >>>>>>>> maybe per shard). >>>>>>>> >>>>>>>> The second question is how that fares on HDDs. My guess is that the >>>>>>>> current rocksdb strategy is better because it reduces the number of >>>>>>>> IOs >>>>>>>> and the additional data getting compacted (and CPU usage) isn't the >>>>>>>> limiting factor on HDD perforamnce (IOPS are). (But maybe we'll get >>>>>>>> lucky >>>>>>>> and the new strategy will be best for both HDD and SSD..) >>>>>>>> >>>>>>>> Then we have to modify PGLog to be a complete implementation. A >>>>>>>> strict >>>>>>>> ring buffer probably won't work because the PG log might not trim >>>>>>>> and >>>>>>>> because log entries are variable length, so there'll probably need >>>>>>>> to be >>>>>>>> some simple mapping table (vs a trivial start/end ring buffer >>>>>>>> position) to >>>>>>>> deal with that. We have to trim the log periodically, so every so >>>>>>>> many >>>>>>>> entries we may want to realign with a min_alloc_size boundary. We >>>>>>>> someones have to back up and rewrite divergent portions of the log >>>>>>>> (during >>>>>>>> peering) so we'll need to sort out whether that is a complete >>>>>>>> reencode/rewrite or whether we keep encoded entries in ram >>>>>>>> (individually >>>>>>>> or in chunks), etc etc. >>>>>>>> >>>>>>>> sage >>>>>>>> -- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>> ceph-devel" in >>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> >>>>> >>>>> -- >>>>> Best wishes >>>>> Lisa >>>> >>>> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- Best wishes Lisa -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html