On Tue, Apr 3, 2018 at 12:50 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote: > Indeed. There was a moderate performance improvement (may 10-20%) but a > dramatic reduction in CPU overhead. Note however that bluestore/rocksdb > will likely show different bottlenecks and performance implications than > petstore did. May I ask what petstore is? > > Mark > > > > On 04/02/2018 11:03 PM, Varada Kari (System Engineer) wrote: >> >> I think Mark tested with MemStore. Should be there in one of the >> performance meetings notes with the results and link. Please check for >> PetStore. >> >> Varada >> >> On Tue, Apr 3, 2018 at 9:15 AM, Li Wang <laurence.liwang@xxxxxxxxx> wrote: >>> >>> Hi, >>> Before we move forward, could someone give a test such that >>> the pglog not written into rocksdb at all, to see how much is the >>> performance improvement as the upper bound, it shoule be less than >>> turning on the bluestore_debug_omit_kv_commit >>> >>> Cheers, >>> Li Wang >>> >>> 2018-04-02 13:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>: >>>> >>>> Hi all, >>>> >>>> Based on your above discussion about pglog, I have the following rough >>>> design. Please help to give your suggestions. >>>> >>>> There will be three partitions: raw part for customer IOs, Bluefs for >>>> Rocksdb, and pglog partition. >>>> The former two partitions are same as current. The pglog partition is >>>> splitted into 1M blocks. We allocate blocks for ring buffers per pg. >>>> We will have such following data: >>>> >>>> Allocation bitmap (just in memory) >>>> >>>> The pglog partition has a bitmap to record which block is allocated or >>>> not. We can rebuild it through pg->allocated_block_list when starting, >>>> and no need to store it in persistent disk. But we will store basic >>>> information about the pglog partition in Rocksdb, like block size, >>>> block number etc when the objectstore is initialized. >>>> >>>> Pg -> allocated_blocks_list >>>> >>>> When a pg is created and IOs start, we can allocate a block for every >>>> pg. Every pglog entry is less than 300 bytes, 1M can store 3495 >>>> entries. When total pglog entries increase and exceed the number, we >>>> can add a new block to the pg. >>>> >>>> Pg->start_position >>>> >>>> Record the oldest valid entry per pg. >>>> >>>> Pg->next_position >>>> >>>> Record the next entry to add per pg. The data will be updated >>>> frequently, but Rocksdb is suitable for its io mode, and most of >>>> data will be merged. >>>> >>>> Updated Bluestore write progess: >>>> >>>> When writing data to disk (before metadata updating), we can append >>>> the pglog entry to its ring buffer in parallel. >>>> After that, submit pg ring buffer changes like pg->next_position, and >>>> current other metadata changes to Rocksdb. >>>> >>>> >>>> On Fri, Mar 30, 2018 at 6:23 PM, Varada Kari <varada.kari@xxxxxxxxx> >>>> wrote: >>>>> >>>>> On Fri, Mar 30, 2018 at 1:01 PM, Li Wang <laurence.liwang@xxxxxxxxx> >>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> If we wanna store pg log in a standalone ring buffer, another >>>>>> candidate >>>>>> is the deferred write, why not use the ring buffer as the journal for >>>>>> 4K random >>>>>> write, it should be much more lightweight than rocksdb >>>>>> >>>>> It will be similar to FileStore implementation, for small writes. That >>>>> comes with the same alignment issues and given >>>>> write amplification. Rocksdb nicely abstracts that and we don't make >>>>> it to L0 files because of WAL handling. >>>>> >>>>> Varada >>>>>> >>>>>> Cheers, >>>>>> Li Wang >>>>>> >>>>>> >>>>>> 2018-03-30 4:04 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>: >>>>>>> >>>>>>> On Wed, 28 Mar 2018, Matt Benjamin wrote: >>>>>>>> >>>>>>>> On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson <mnelson@xxxxxxxxxx> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On 03/28/2018 12:21 PM, Adam C. Emerson wrote: >>>>>>>>> >>>>>>>>> 2) It sure feels like conceptually the pglog should be represented >>>>>>>>> as a >>>>>>>>> per-pg ring buffer rather than key/value data. Maybe there are >>>>>>>>> really >>>>>>>>> important reasons that it shouldn't be, but I don't currently see >>>>>>>>> them. As >>>>>>>>> far as the objectstore is concerned, it seems to me like there are >>>>>>>>> valid >>>>>>>>> reasons to provide some kind of log interface and perhaps that >>>>>>>>> should be >>>>>>>>> used for pg_log. That sort of opens the door for different object >>>>>>>>> store >>>>>>>>> implementations fulfilling that functionality in whatever ways the >>>>>>>>> author >>>>>>>>> deems fit. >>>>>>>> >>>>>>>> In the reddit lingo, pretty much this. We should be concentrating >>>>>>>> on >>>>>>>> this direction, or ruling it out. >>>>>>> >>>>>>> Yeah, +1 >>>>>>> >>>>>>> It seems like step 1 is a proof of concept branch that encodes >>>>>>> pg_log_entry_t's and writes them to a simple ring buffer. The first >>>>>>> questions to answer is (a) whether this does in fact improve things >>>>>>> significantly and (b) whether we want to have an independent ring >>>>>>> buffer >>>>>>> for each PG or try to mix them into one big one for the whole OSD (or >>>>>>> maybe per shard). >>>>>>> >>>>>>> The second question is how that fares on HDDs. My guess is that the >>>>>>> current rocksdb strategy is better because it reduces the number of >>>>>>> IOs >>>>>>> and the additional data getting compacted (and CPU usage) isn't the >>>>>>> limiting factor on HDD perforamnce (IOPS are). (But maybe we'll get >>>>>>> lucky >>>>>>> and the new strategy will be best for both HDD and SSD..) >>>>>>> >>>>>>> Then we have to modify PGLog to be a complete implementation. A >>>>>>> strict >>>>>>> ring buffer probably won't work because the PG log might not trim and >>>>>>> because log entries are variable length, so there'll probably need to >>>>>>> be >>>>>>> some simple mapping table (vs a trivial start/end ring buffer >>>>>>> position) to >>>>>>> deal with that. We have to trim the log periodically, so every so >>>>>>> many >>>>>>> entries we may want to realign with a min_alloc_size boundary. We >>>>>>> someones have to back up and rewrite divergent portions of the log >>>>>>> (during >>>>>>> peering) so we'll need to sort out whether that is a complete >>>>>>> reencode/rewrite or whether we keep encoded entries in ram >>>>>>> (individually >>>>>>> or in chunks), etc etc. >>>>>>> >>>>>>> sage >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>>> in >>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>>> >>>> -- >>>> Best wishes >>>> Lisa >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- Best wishes Lisa -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html