Re: storing pg logs outside of rocksdb

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks for the feedback, I am going to start the prototype.

On Tue, Apr 3, 2018 at 2:00 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
> On 04/01/2018 10:29 PM, xiaoyan li wrote:
>>
>> Hi all,
>>
>> Based on your above discussion about pglog, I have the following rough
>> design. Please help to give your suggestions.
>>
>> There will be three partitions: raw part for customer IOs, Bluefs for
>> Rocksdb, and pglog partition.
>> The former two partitions are same as current. The pglog partition is
>> splitted into 1M blocks. We allocate blocks for ring buffers per pg.
>> We will have such following data:
>
>
> This isn't relevant for prototyping this, but a partition just used for
> pg logs shouldn't be another piece an admin needs to setup. Since
> this optimization is only applicable for all-flash scenarios, bluestore
> can hide the internal structure and allocate the separate pg log space
> itself, within the data device.
Yes, exactly. Bluestore can handle it inside.
>
>> Allocation bitmap (just in memory)
>>
>> The pglog partition has a bitmap to record which block is allocated or
>> not. We can rebuild it through pg->allocated_block_list when starting,
>> and no need to store it in persistent disk. But we will store basic
>> information about the pglog partition in Rocksdb, like block size,
>> block number etc when the objectstore is initialized.
>>
>> Pg -> allocated_blocks_list
>>
>> When a pg is created and IOs start, we can allocate a block for every
>> pg. Every pglog entry is less than 300 bytes, 1M can store 3495
>> entries. When total pglog entries increase and exceed the number, we
>> can add a new block to the pg.
>>
>> Pg->start_position
>>
>> Record the oldest valid entry per pg.
>>
>> Pg->next_position
>>
>> Record the next entry to add per pg. The data will be updated
>> frequently, but Rocksdb is suitable for its io mode, and most of
>> data will be merged.
>>
>> Updated Bluestore write progess:
>>
>> When writing data to disk (before metadata updating), we can append
>> the pglog entry to its ring buffer in parallel.
>> After that, submit pg ring buffer changes like pg->next_position, and
>> current other metadata changes to Rocksdb.
>
>
> This sounds good to me.
>
> Josh



-- 
Best wishes
Lisa
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux