On Tue, 3 Apr 2018, Li Wang wrote: > Hi, > Before we move forward, could someone give a test such that > the pglog not written into rocksdb at all, to see how much is the > performance improvement as the upper bound, it shoule be less than > turning on the bluestore_debug_omit_kv_commit +1 (The PetStore behavior doesn't tell us anything about how BlueStore will behave without the pglog overhead.) sage > > Cheers, > Li Wang > > 2018-04-02 13:29 GMT+08:00 xiaoyan li <wisher2003@xxxxxxxxx>: > > Hi all, > > > > Based on your above discussion about pglog, I have the following rough > > design. Please help to give your suggestions. > > > > There will be three partitions: raw part for customer IOs, Bluefs for > > Rocksdb, and pglog partition. > > The former two partitions are same as current. The pglog partition is > > splitted into 1M blocks. We allocate blocks for ring buffers per pg. > > We will have such following data: > > > > Allocation bitmap (just in memory) > > > > The pglog partition has a bitmap to record which block is allocated or > > not. We can rebuild it through pg->allocated_block_list when starting, > > and no need to store it in persistent disk. But we will store basic > > information about the pglog partition in Rocksdb, like block size, > > block number etc when the objectstore is initialized. > > > > Pg -> allocated_blocks_list > > > > When a pg is created and IOs start, we can allocate a block for every > > pg. Every pglog entry is less than 300 bytes, 1M can store 3495 > > entries. When total pglog entries increase and exceed the number, we > > can add a new block to the pg. > > > > Pg->start_position > > > > Record the oldest valid entry per pg. > > > > Pg->next_position > > > > Record the next entry to add per pg. The data will be updated > > frequently, but Rocksdb is suitable for its io mode, and most of > > data will be merged. > > > > Updated Bluestore write progess: > > > > When writing data to disk (before metadata updating), we can append > > the pglog entry to its ring buffer in parallel. > > After that, submit pg ring buffer changes like pg->next_position, and > > current other metadata changes to Rocksdb. > > > > > > On Fri, Mar 30, 2018 at 6:23 PM, Varada Kari <varada.kari@xxxxxxxxx> wrote: > >> On Fri, Mar 30, 2018 at 1:01 PM, Li Wang <laurence.liwang@xxxxxxxxx> wrote: > >>> Hi, > >>> If we wanna store pg log in a standalone ring buffer, another candidate > >>> is the deferred write, why not use the ring buffer as the journal for 4K random > >>> write, it should be much more lightweight than rocksdb > >>> > >> It will be similar to FileStore implementation, for small writes. That > >> comes with the same alignment issues and given > >> write amplification. Rocksdb nicely abstracts that and we don't make > >> it to L0 files because of WAL handling. > >> > >> Varada > >>> Cheers, > >>> Li Wang > >>> > >>> > >>> 2018-03-30 4:04 GMT+08:00 Sage Weil <sweil@xxxxxxxxxx>: > >>>> On Wed, 28 Mar 2018, Matt Benjamin wrote: > >>>>> On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote: > >>>>> > On 03/28/2018 12:21 PM, Adam C. Emerson wrote: > >>>>> > > >>>>> > 2) It sure feels like conceptually the pglog should be represented as a > >>>>> > per-pg ring buffer rather than key/value data. Maybe there are really > >>>>> > important reasons that it shouldn't be, but I don't currently see them. As > >>>>> > far as the objectstore is concerned, it seems to me like there are valid > >>>>> > reasons to provide some kind of log interface and perhaps that should be > >>>>> > used for pg_log. That sort of opens the door for different object store > >>>>> > implementations fulfilling that functionality in whatever ways the author > >>>>> > deems fit. > >>>>> > >>>>> In the reddit lingo, pretty much this. We should be concentrating on > >>>>> this direction, or ruling it out. > >>>> > >>>> Yeah, +1 > >>>> > >>>> It seems like step 1 is a proof of concept branch that encodes > >>>> pg_log_entry_t's and writes them to a simple ring buffer. The first > >>>> questions to answer is (a) whether this does in fact improve things > >>>> significantly and (b) whether we want to have an independent ring buffer > >>>> for each PG or try to mix them into one big one for the whole OSD (or > >>>> maybe per shard). > >>>> > >>>> The second question is how that fares on HDDs. My guess is that the > >>>> current rocksdb strategy is better because it reduces the number of IOs > >>>> and the additional data getting compacted (and CPU usage) isn't the > >>>> limiting factor on HDD perforamnce (IOPS are). (But maybe we'll get lucky > >>>> and the new strategy will be best for both HDD and SSD..) > >>>> > >>>> Then we have to modify PGLog to be a complete implementation. A strict > >>>> ring buffer probably won't work because the PG log might not trim and > >>>> because log entries are variable length, so there'll probably need to be > >>>> some simple mapping table (vs a trivial start/end ring buffer position) to > >>>> deal with that. We have to trim the log periodically, so every so many > >>>> entries we may want to realign with a min_alloc_size boundary. We > >>>> someones have to back up and rewrite divergent portions of the log (during > >>>> peering) so we'll need to sort out whether that is a complete > >>>> reencode/rewrite or whether we keep encoded entries in ram (individually > >>>> or in chunks), etc etc. > >>>> > >>>> sage > >>>> -- > >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx > >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > -- > > Best wishes > > Lisa > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html