Thanks for the feedback, I am going to start the prototype. On Tue, Apr 3, 2018 at 2:00 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote: > On 04/01/2018 10:29 PM, xiaoyan li wrote: >> >> Hi all, >> >> Based on your above discussion about pglog, I have the following rough >> design. Please help to give your suggestions. >> >> There will be three partitions: raw part for customer IOs, Bluefs for >> Rocksdb, and pglog partition. >> The former two partitions are same as current. The pglog partition is >> splitted into 1M blocks. We allocate blocks for ring buffers per pg. >> We will have such following data: > > > This isn't relevant for prototyping this, but a partition just used for > pg logs shouldn't be another piece an admin needs to setup. Since > this optimization is only applicable for all-flash scenarios, bluestore > can hide the internal structure and allocate the separate pg log space > itself, within the data device. Yes, exactly. Bluestore can handle it inside. > >> Allocation bitmap (just in memory) >> >> The pglog partition has a bitmap to record which block is allocated or >> not. We can rebuild it through pg->allocated_block_list when starting, >> and no need to store it in persistent disk. But we will store basic >> information about the pglog partition in Rocksdb, like block size, >> block number etc when the objectstore is initialized. >> >> Pg -> allocated_blocks_list >> >> When a pg is created and IOs start, we can allocate a block for every >> pg. Every pglog entry is less than 300 bytes, 1M can store 3495 >> entries. When total pglog entries increase and exceed the number, we >> can add a new block to the pg. >> >> Pg->start_position >> >> Record the oldest valid entry per pg. >> >> Pg->next_position >> >> Record the next entry to add per pg. The data will be updated >> frequently, but Rocksdb is suitable for its io mode, and most of >> data will be merged. >> >> Updated Bluestore write progess: >> >> When writing data to disk (before metadata updating), we can append >> the pglog entry to its ring buffer in parallel. >> After that, submit pg ring buffer changes like pg->next_position, and >> current other metadata changes to Rocksdb. > > > This sounds good to me. > > Josh -- Best wishes Lisa -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html