Re: Bluestore OSD_DATA, WAL & DB

Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx> · Mon, 25 Sep 2017 16:44:28 +0200

On 09/25/2017 02:59 PM, Mark Nelson wrote:
> On 09/25/2017 03:31 AM, TYLin wrote:
>> Hi,
>>
>> To my understand, the bluestore write workflow is
>>
>> For normal big write
>> 1. Write data to block
>> 2. Update metadata to rocksdb
>> 3. Rocksdb write to memory and block.wal
>> 4. Once reach threshold, flush entries in block.wal to block.db
>>
>> For overwrite and small write
>> 1. Write data and metadata to rocksdb
>> 2. Apply the data to block
>>
>> Seems we don’t have a formula or suggestion to the size of block.db.
>> It depends on the object size and number of objects in your pool. You
>> can just give big partition to block.db to ensure all the database
>> files are on that fast partition. If block.db full, it will use block
>> to put db files, however, this will slow down the db performance. So
>> give db size as much as you can.
> 
> This is basically correct.  What's more, it's not just the object size,
> but the number of extents, checksums, RGW bucket indices, and
> potentially other random stuff.  I'm skeptical how well we can estimate
> all of this in the long run.  I wonder if we would be better served by
> just focusing on making it easy to understand how the DB device is being
> used, how much is spilling over to the block device, and make it easy to
> upgrade to a new device once it gets full.
> 
>>
>> If you want to put wal and db on same ssd, you don’t need to create
>> block.wal. It will implicitly use block.db to put wal. The only case
>> you need block.wal is that you want to separate wal to another disk.
> 
> I always make explicit partitions, but only because I (potentially
> illogically) like it that way.  There may actually be some benefits to
> using a single partition for both if sharing a single device.

is this "Single db/wal partition" then to be used for all OSDs on a node
or do you need to create a seperate "Single  db/wal partition" for each
OSD  on the node?

> 
>>
>> I’m also studying bluestore, this is what I know so far. Any
>> correction is welcomed.
>>
>> Thanks
>>
>>
>>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
>>> <richard.hesketh@xxxxxxxxxxxx> wrote:
>>>
>>> I asked the same question a couple of weeks ago. No response I got
>>> contradicted the documentation but nobody actively confirmed the
>>> documentation was correct on this subject, either; my end state was
>>> that I was relatively confident I wasn't making some horrible mistake
>>> by simply specifying a big DB partition and letting bluestore work
>>> itself out (in my case, I've just got HDDs and SSDs that were
>>> journals under filestore), but I could not be sure there wasn't some
>>> sort of performance tuning I was missing out on by not specifying
>>> them separately.
>>>
>>> Rich
>>>
>>> On 21/09/17 20:37, Benjeman Meekhof wrote:
>>>> Some of this thread seems to contradict the documentation and confuses
>>>> me.  Is the statement below correct?
>>>>
>>>> "The BlueStore journal will always be placed on the fastest device
>>>> available, so using a DB device will provide the same benefit that the
>>>> WAL device would while also allowing additional metadata to be stored
>>>> there (if it will fix)."
>>>>
>>>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
>>>>
>>>>
>>>>  it seems to be saying that there's no reason to create separate WAL
>>>> and DB partitions if they are on the same device.  Specifying one
>>>> large DB partition per OSD will cover both uses.
>>>>
>>>> thanks,
>>>> Ben
>>>>
>>>> On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
>>>> <dietmar.rieder@xxxxxxxxxxx> wrote:
>>>>> On 09/21/2017 05:03 PM, Mark Nelson wrote:
>>>>>>
>>>>>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>>>>>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>>>>>>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'm still looking for the answer of these questions. Maybe
>>>>>>>>> someone can
>>>>>>>>> share their thought on these. Any comment will be helpful too.
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>>
>>>>>>>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>>>>>>>>> <mrxlazuardin@xxxxxxxxx <mailto:mrxlazuardin@xxxxxxxxx>> wrote:
>>>>>>>>>
>>>>>>>>>     Hi,
>>>>>>>>>
>>>>>>>>>     1. Is it possible configure use osd_data not as small
>>>>>>>>> partition on
>>>>>>>>>     OSD but a folder (ex. on root disk)? If yes, how to do that
>>>>>>>>> with
>>>>>>>>>     ceph-disk and any pros/cons of doing that?
>>>>>>>>>     2. Is WAL & DB size calculated based on OSD size or expected
>>>>>>>>>     throughput like on journal device of filestore? If no, what
>>>>>>>>> is the
>>>>>>>>>     default value and pro/cons of adjusting that?
>>>>>>>>>     3. Is partition alignment matter on Bluestore, including
>>>>>>>>> WAL & DB
>>>>>>>>>     if using separate device for them?
>>>>>>>>>
>>>>>>>>>     Best regards,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list
>>>>>>>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>>>
>>>>>>>> I am also looking for recommendations on wal/db partition sizes.
>>>>>>>> Some
>>>>>>>> hints:
>>>>>>>>
>>>>>>>> ceph-disk defaults used in case it does not find
>>>>>>>> bluestore_block_wal_size or bluestore_block_db_size in config file:
>>>>>>>>
>>>>>>>> wal =  512MB
>>>>>>>>
>>>>>>>> db = if bluestore_block_size (data size) is in config file it
>>>>>>>> uses 1/100
>>>>>>>> of it else it uses 1G.
>>>>>>>>
>>>>>>>> There is also a presentation by Sage back in March, see page 16:
>>>>>>>>
>>>>>>>> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> wal: 512 MB
>>>>>>>>
>>>>>>>> db: "a few" GB
>>>>>>>>
>>>>>>>> the wal size is probably not debatable, it will be like a
>>>>>>>> journal for
>>>>>>>> small block sizes which are constrained by iops hence 512 MB is
>>>>>>>> more
>>>>>>>> than enough. Probably we will see more on the db size in the
>>>>>>>> future.
>>>>>>> This is what I understood so far.
>>>>>>> I wonder if it makes sense to set the db size as big as possible and
>>>>>>> divide entire db device is  by the number of OSDs it will serve.
>>>>>>>
>>>>>>> E.g. 10 OSDs / 1 NVME (800GB)
>>>>>>>
>>>>>>>  (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
>>>>>>>
>>>>>>> Is this smart/stupid?
>>>>>> Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write
>>>>>> amp but mean larger memtables and potentially higher overhead
>>>>>> scanning
>>>>>> through memtables).  4x256MB buffers works pretty well, but it means
>>>>>> memory overhead too.  Beyond that, I'd devote the entire rest of the
>>>>>> device to DB partitions.
>>>>>>
>>>>> thanks for your suggestion Mark!
>>>>>
>>>>> So, just to make sure I understood this right:
>>>>>
>>>>> You'd  use a separeate 512MB-2GB WAL partition for each OSD and the
>>>>> entire rest for DB partitions.
>>>>>
>>>>> In the example case with 10xHDD OSD and 1 NVME it would then be 10 WAL
>>>>> partitions with each 512MB-2GB and 10 equal sized DB partitions
>>>>> consuming the rest of the NVME.
>>>>>
>>>>>
>>>>> Thanks
>>>>>   Dietmar
>>>>> -- 
>>>>> _________________________________________
>>>>> D i e t m a r  R i e d e r, Mag.Dr.
>>>>> Innsbruck Medical University
>>>>> Biocenter - Division for Bioinformatics
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
_________________________________________
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com