Re: Bluestore OSD_DATA, WAL & DB

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



thanks David,

that's confirming what I was assuming. To bad that there is no
estimate/method to calculate the db partition size.

Dietmar

On 09/25/2017 05:10 PM, David Turner wrote:
> db/wal partitions are per OSD.  DB partitions need to be made as big as
> you need them.  If they run out of space, they will fall back to the
> block device.  If the DB and block are on the same device, then there's
> no reason to partition them and figure out the best size.  If they are
> on separate devices, then you need to make it as big as you need to to
> ensure that it won't spill over (or if it does that you're ok with the
> degraded performance while the db partition is full).  I haven't come
> across an equation to judge what size should be used for either
> partition yet.
> 
> On Mon, Sep 25, 2017 at 10:53 AM Dietmar Rieder
> <dietmar.rieder@xxxxxxxxxxx <mailto:dietmar.rieder@xxxxxxxxxxx>> wrote:
> 
>     On 09/25/2017 02:59 PM, Mark Nelson wrote:
>     > On 09/25/2017 03:31 AM, TYLin wrote:
>     >> Hi,
>     >>
>     >> To my understand, the bluestore write workflow is
>     >>
>     >> For normal big write
>     >> 1. Write data to block
>     >> 2. Update metadata to rocksdb
>     >> 3. Rocksdb write to memory and block.wal
>     >> 4. Once reach threshold, flush entries in block.wal to block.db
>     >>
>     >> For overwrite and small write
>     >> 1. Write data and metadata to rocksdb
>     >> 2. Apply the data to block
>     >>
>     >> Seems we don’t have a formula or suggestion to the size of block.db.
>     >> It depends on the object size and number of objects in your pool. You
>     >> can just give big partition to block.db to ensure all the database
>     >> files are on that fast partition. If block.db full, it will use block
>     >> to put db files, however, this will slow down the db performance. So
>     >> give db size as much as you can.
>     >
>     > This is basically correct.  What's more, it's not just the object
>     size,
>     > but the number of extents, checksums, RGW bucket indices, and
>     > potentially other random stuff.  I'm skeptical how well we can
>     estimate
>     > all of this in the long run.  I wonder if we would be better served by
>     > just focusing on making it easy to understand how the DB device is
>     being
>     > used, how much is spilling over to the block device, and make it
>     easy to
>     > upgrade to a new device once it gets full.
>     >
>     >>
>     >> If you want to put wal and db on same ssd, you don’t need to create
>     >> block.wal. It will implicitly use block.db to put wal. The only case
>     >> you need block.wal is that you want to separate wal to another disk.
>     >
>     > I always make explicit partitions, but only because I (potentially
>     > illogically) like it that way.  There may actually be some benefits to
>     > using a single partition for both if sharing a single device.
> 
>     is this "Single db/wal partition" then to be used for all OSDs on a node
>     or do you need to create a seperate "Single  db/wal partition" for each
>     OSD  on the node?
> 
>     >
>     >>
>     >> I’m also studying bluestore, this is what I know so far. Any
>     >> correction is welcomed.
>     >>
>     >> Thanks
>     >>
>     >>
>     >>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
>     >>> <richard.hesketh@xxxxxxxxxxxx
>     <mailto:richard.hesketh@xxxxxxxxxxxx>> wrote:
>     >>>
>     >>> I asked the same question a couple of weeks ago. No response I got
>     >>> contradicted the documentation but nobody actively confirmed the
>     >>> documentation was correct on this subject, either; my end state was
>     >>> that I was relatively confident I wasn't making some horrible
>     mistake
>     >>> by simply specifying a big DB partition and letting bluestore work
>     >>> itself out (in my case, I've just got HDDs and SSDs that were
>     >>> journals under filestore), but I could not be sure there wasn't some
>     >>> sort of performance tuning I was missing out on by not specifying
>     >>> them separately.
>     >>>
>     >>> Rich
>     >>>
>     >>> On 21/09/17 20:37, Benjeman Meekhof wrote:
>     >>>> Some of this thread seems to contradict the documentation and
>     confuses
>     >>>> me.  Is the statement below correct?
>     >>>>
>     >>>> "The BlueStore journal will always be placed on the fastest device
>     >>>> available, so using a DB device will provide the same benefit
>     that the
>     >>>> WAL device would while also allowing additional metadata to be
>     stored
>     >>>> there (if it will fix)."
>     >>>>
>     >>>>
>     http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
>     >>>>
>     >>>>
>     >>>>  it seems to be saying that there's no reason to create
>     separate WAL
>     >>>> and DB partitions if they are on the same device.  Specifying one
>     >>>> large DB partition per OSD will cover both uses.
>     >>>>
>     >>>> thanks,
>     >>>> Ben
>     >>>>
>     >>>> On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
>     >>>> <dietmar.rieder@xxxxxxxxxxx
>     <mailto:dietmar.rieder@xxxxxxxxxxx>> wrote:
>     >>>>> On 09/21/2017 05:03 PM, Mark Nelson wrote:
>     >>>>>>
>     >>>>>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
>     >>>>>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
>     >>>>>>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
>     >>>>>>>>
>     >>>>>>>>> Hi,
>     >>>>>>>>>
>     >>>>>>>>> I'm still looking for the answer of these questions. Maybe
>     >>>>>>>>> someone can
>     >>>>>>>>> share their thought on these. Any comment will be helpful too.
>     >>>>>>>>>
>     >>>>>>>>> Best regards,
>     >>>>>>>>>
>     >>>>>>>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
>     >>>>>>>>> <mrxlazuardin@xxxxxxxxx <mailto:mrxlazuardin@xxxxxxxxx>
>     <mailto:mrxlazuardin@xxxxxxxxx <mailto:mrxlazuardin@xxxxxxxxx>>> wrote:
>     >>>>>>>>>
>     >>>>>>>>>     Hi,
>     >>>>>>>>>
>     >>>>>>>>>     1. Is it possible configure use osd_data not as small
>     >>>>>>>>> partition on
>     >>>>>>>>>     OSD but a folder (ex. on root disk)? If yes, how to do
>     that
>     >>>>>>>>> with
>     >>>>>>>>>     ceph-disk and any pros/cons of doing that?
>     >>>>>>>>>     2. Is WAL & DB size calculated based on OSD size or
>     expected
>     >>>>>>>>>     throughput like on journal device of filestore? If no,
>     what
>     >>>>>>>>> is the
>     >>>>>>>>>     default value and pro/cons of adjusting that?
>     >>>>>>>>>     3. Is partition alignment matter on Bluestore, including
>     >>>>>>>>> WAL & DB
>     >>>>>>>>>     if using separate device for them?
>     >>>>>>>>>
>     >>>>>>>>>     Best regards,
>     >>>>>>>>>
>     >>>>>>>>>
>     >>>>>>>>> _______________________________________________
>     >>>>>>>>> ceph-users mailing list
>     >>>>>>>>> ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >>>>>>>>
>     >>>>>>>>
>     >>>>>>>> I am also looking for recommendations on wal/db partition
>     sizes.
>     >>>>>>>> Some
>     >>>>>>>> hints:
>     >>>>>>>>
>     >>>>>>>> ceph-disk defaults used in case it does not find
>     >>>>>>>> bluestore_block_wal_size or bluestore_block_db_size in
>     config file:
>     >>>>>>>>
>     >>>>>>>> wal =  512MB
>     >>>>>>>>
>     >>>>>>>> db = if bluestore_block_size (data size) is in config file it
>     >>>>>>>> uses 1/100
>     >>>>>>>> of it else it uses 1G.
>     >>>>>>>>
>     >>>>>>>> There is also a presentation by Sage back in March, see
>     page 16:
>     >>>>>>>>
>     >>>>>>>>
>     https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
>     >>>>>>>>
>     >>>>>>>>
>     >>>>>>>>
>     >>>>>>>> wal: 512 MB
>     >>>>>>>>
>     >>>>>>>> db: "a few" GB
>     >>>>>>>>
>     >>>>>>>> the wal size is probably not debatable, it will be like a
>     >>>>>>>> journal for
>     >>>>>>>> small block sizes which are constrained by iops hence 512 MB is
>     >>>>>>>> more
>     >>>>>>>> than enough. Probably we will see more on the db size in the
>     >>>>>>>> future.
>     >>>>>>> This is what I understood so far.
>     >>>>>>> I wonder if it makes sense to set the db size as big as
>     possible and
>     >>>>>>> divide entire db device is  by the number of OSDs it will serve.
>     >>>>>>>
>     >>>>>>> E.g. 10 OSDs / 1 NVME (800GB)
>     >>>>>>>
>     >>>>>>>  (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
>     >>>>>>>
>     >>>>>>> Is this smart/stupid?
>     >>>>>> Personally I'd use 512MB-2GB for the WAL (larger buffers
>     reduce write
>     >>>>>> amp but mean larger memtables and potentially higher overhead
>     >>>>>> scanning
>     >>>>>> through memtables).  4x256MB buffers works pretty well, but
>     it means
>     >>>>>> memory overhead too.  Beyond that, I'd devote the entire rest
>     of the
>     >>>>>> device to DB partitions.
>     >>>>>>
>     >>>>> thanks for your suggestion Mark!
>     >>>>>
>     >>>>> So, just to make sure I understood this right:
>     >>>>>
>     >>>>> You'd  use a separeate 512MB-2GB WAL partition for each OSD
>     and the
>     >>>>> entire rest for DB partitions.
>     >>>>>
>     >>>>> In the example case with 10xHDD OSD and 1 NVME it would then
>     be 10 WAL
>     >>>>> partitions with each 512MB-2GB and 10 equal sized DB partitions
>     >>>>> consuming the rest of the NVME.
>     >>>>>
>     >>>>>
>     >>>>> Thanks
>     >>>>>   Dietmar
>     >>>>> --
>     >>>>> _________________________________________
>     >>>>> D i e t m a r  R i e d e r, Mag.Dr.
>     >>>>> Innsbruck Medical University
>     >>>>> Biocenter - Division for Bioinformatics
>     >>>>>
>     >>>>>
>     >>>>>
>     >>>>> _______________________________________________
>     >>>>> ceph-users mailing list
>     >>>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >>>>>
>     >>>> _______________________________________________
>     >>>> ceph-users mailing list
>     >>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >>>
>     >>> _______________________________________________
>     >>> ceph-users mailing list
>     >>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >>
>     >> _______________________________________________
>     >> ceph-users mailing list
>     >> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >>
>     > _______________________________________________
>     > ceph-users mailing list
>     > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
>     --
>     _________________________________________
>     D i e t m a r  R i e d e r, Mag.Dr.
>     Innsbruck Medical University
>     Biocenter - Division for Bioinformatics
> 
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
_________________________________________
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics


Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux