Re: Bluestore OSD_DATA, WAL & DB

Wido den Hollander <wido@xxxxxxxx> · Mon, 16 Oct 2017 14:45:13 +0200 (CEST)

> Op 26 september 2017 om 16:39 schreef Mark Nelson <mnelson@xxxxxxxxxx>:
> 
> 
> 
> 
> On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
> > thanks David,
> >
> > that's confirming what I was assuming. To bad that there is no
> > estimate/method to calculate the db partition size.
> 
> It's possible that we might be able to get ranges for certain kinds of 
> scenarios.  Maybe if you do lots of small random writes on RBD, you can 
> expect a typical metadata size of X per object.  Or maybe if you do lots 
> of large sequential object writes in RGW, it's more like Y.  I think 
> it's probably going to be tough to make it accurate for everyone though.
> 

So I did a quick test. I wrote 75.000 objects to a BlueStore device:

root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
75085
root@alpha:~# 

I then saw the RocksDB database was 450MB in size:

root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
459276288
root@alpha:~#

459276288 / 75085 = 6116

So about 6kb of RocksDB data per object.

Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB space.

Is this a safe assumption? Do you think that 6kb is normal? Low? High?

There aren't many of these numbers out there for BlueStore right now so I'm trying to gather some numbers.

Wido

> Mark
> 
> >
> > Dietmar
> >
> > On 09/25/2017 05:10 PM, David Turner wrote:
> >> db/wal partitions are per OSD.  DB partitions need to be made as big as
> >> you need them.  If they run out of space, they will fall back to the
> >> block device.  If the DB and block are on the same device, then there's
> >> no reason to partition them and figure out the best size.  If they are
> >> on separate devices, then you need to make it as big as you need to to
> >> ensure that it won't spill over (or if it does that you're ok with the
> >> degraded performance while the db partition is full).  I haven't come
> >> across an equation to judge what size should be used for either
> >> partition yet.
> >>
> >> On Mon, Sep 25, 2017 at 10:53 AM Dietmar Rieder
> >> <dietmar.rieder@xxxxxxxxxxx <mailto:dietmar.rieder@xxxxxxxxxxx>> wrote:
> >>
> >>     On 09/25/2017 02:59 PM, Mark Nelson wrote:
> >>     > On 09/25/2017 03:31 AM, TYLin wrote:
> >>     >> Hi,
> >>     >>
> >>     >> To my understand, the bluestore write workflow is
> >>     >>
> >>     >> For normal big write
> >>     >> 1. Write data to block
> >>     >> 2. Update metadata to rocksdb
> >>     >> 3. Rocksdb write to memory and block.wal
> >>     >> 4. Once reach threshold, flush entries in block.wal to block.db
> >>     >>
> >>     >> For overwrite and small write
> >>     >> 1. Write data and metadata to rocksdb
> >>     >> 2. Apply the data to block
> >>     >>
> >>     >> Seems we don’t have a formula or suggestion to the size of block.db.
> >>     >> It depends on the object size and number of objects in your pool. You
> >>     >> can just give big partition to block.db to ensure all the database
> >>     >> files are on that fast partition. If block.db full, it will use block
> >>     >> to put db files, however, this will slow down the db performance. So
> >>     >> give db size as much as you can.
> >>     >
> >>     > This is basically correct.  What's more, it's not just the object
> >>     size,
> >>     > but the number of extents, checksums, RGW bucket indices, and
> >>     > potentially other random stuff.  I'm skeptical how well we can
> >>     estimate
> >>     > all of this in the long run.  I wonder if we would be better served by
> >>     > just focusing on making it easy to understand how the DB device is
> >>     being
> >>     > used, how much is spilling over to the block device, and make it
> >>     easy to
> >>     > upgrade to a new device once it gets full.
> >>     >
> >>     >>
> >>     >> If you want to put wal and db on same ssd, you don’t need to create
> >>     >> block.wal. It will implicitly use block.db to put wal. The only case
> >>     >> you need block.wal is that you want to separate wal to another disk.
> >>     >
> >>     > I always make explicit partitions, but only because I (potentially
> >>     > illogically) like it that way.  There may actually be some benefits to
> >>     > using a single partition for both if sharing a single device.
> >>
> >>     is this "Single db/wal partition" then to be used for all OSDs on a node
> >>     or do you need to create a seperate "Single  db/wal partition" for each
> >>     OSD  on the node?
> >>
> >>     >
> >>     >>
> >>     >> I’m also studying bluestore, this is what I know so far. Any
> >>     >> correction is welcomed.
> >>     >>
> >>     >> Thanks
> >>     >>
> >>     >>
> >>     >>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
> >>     >>> <richard.hesketh@xxxxxxxxxxxx
> >>     <mailto:richard.hesketh@xxxxxxxxxxxx>> wrote:
> >>     >>>
> >>     >>> I asked the same question a couple of weeks ago. No response I got
> >>     >>> contradicted the documentation but nobody actively confirmed the
> >>     >>> documentation was correct on this subject, either; my end state was
> >>     >>> that I was relatively confident I wasn't making some horrible
> >>     mistake
> >>     >>> by simply specifying a big DB partition and letting bluestore work
> >>     >>> itself out (in my case, I've just got HDDs and SSDs that were
> >>     >>> journals under filestore), but I could not be sure there wasn't some
> >>     >>> sort of performance tuning I was missing out on by not specifying
> >>     >>> them separately.
> >>     >>>
> >>     >>> Rich
> >>     >>>
> >>     >>> On 21/09/17 20:37, Benjeman Meekhof wrote:
> >>     >>>> Some of this thread seems to contradict the documentation and
> >>     confuses
> >>     >>>> me.  Is the statement below correct?
> >>     >>>>
> >>     >>>> "The BlueStore journal will always be placed on the fastest device
> >>     >>>> available, so using a DB device will provide the same benefit
> >>     that the
> >>     >>>> WAL device would while also allowing additional metadata to be
> >>     stored
> >>     >>>> there (if it will fix)."
> >>     >>>>
> >>     >>>>
> >>     http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices
> >>     >>>>
> >>     >>>>
> >>     >>>>  it seems to be saying that there's no reason to create
> >>     separate WAL
> >>     >>>> and DB partitions if they are on the same device.  Specifying one
> >>     >>>> large DB partition per OSD will cover both uses.
> >>     >>>>
> >>     >>>> thanks,
> >>     >>>> Ben
> >>     >>>>
> >>     >>>> On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
> >>     >>>> <dietmar.rieder@xxxxxxxxxxx
> >>     <mailto:dietmar.rieder@xxxxxxxxxxx>> wrote:
> >>     >>>>> On 09/21/2017 05:03 PM, Mark Nelson wrote:
> >>     >>>>>>
> >>     >>>>>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote:
> >>     >>>>>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote:
> >>     >>>>>>>> On 2017-09-21 07:56, Lazuardi Nasution wrote:
> >>     >>>>>>>>
> >>     >>>>>>>>> Hi,
> >>     >>>>>>>>>
> >>     >>>>>>>>> I'm still looking for the answer of these questions. Maybe
> >>     >>>>>>>>> someone can
> >>     >>>>>>>>> share their thought on these. Any comment will be helpful too.
> >>     >>>>>>>>>
> >>     >>>>>>>>> Best regards,
> >>     >>>>>>>>>
> >>     >>>>>>>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
> >>     >>>>>>>>> <mrxlazuardin@xxxxxxxxx <mailto:mrxlazuardin@xxxxxxxxx>
> >>     <mailto:mrxlazuardin@xxxxxxxxx <mailto:mrxlazuardin@xxxxxxxxx>>> wrote:
> >>     >>>>>>>>>
> >>     >>>>>>>>>     Hi,
> >>     >>>>>>>>>
> >>     >>>>>>>>>     1. Is it possible configure use osd_data not as small
> >>     >>>>>>>>> partition on
> >>     >>>>>>>>>     OSD but a folder (ex. on root disk)? If yes, how to do
> >>     that
> >>     >>>>>>>>> with
> >>     >>>>>>>>>     ceph-disk and any pros/cons of doing that?
> >>     >>>>>>>>>     2. Is WAL & DB size calculated based on OSD size or
> >>     expected
> >>     >>>>>>>>>     throughput like on journal device of filestore? If no,
> >>     what
> >>     >>>>>>>>> is the
> >>     >>>>>>>>>     default value and pro/cons of adjusting that?
> >>     >>>>>>>>>     3. Is partition alignment matter on Bluestore, including
> >>     >>>>>>>>> WAL & DB
> >>     >>>>>>>>>     if using separate device for them?
> >>     >>>>>>>>>
> >>     >>>>>>>>>     Best regards,
> >>     >>>>>>>>>
> >>     >>>>>>>>>
> >>     >>>>>>>>> _______________________________________________
> >>     >>>>>>>>> ceph-users mailing list
> >>     >>>>>>>>> ceph-users@xxxxxxxxxxxxxx
> >>     <mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxxxxxx
> >>     <mailto:ceph-users@xxxxxxxxxxxxxx>>
> >>     >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>     >>>>>>>>
> >>     >>>>>>>>
> >>     >>>>>>>> I am also looking for recommendations on wal/db partition
> >>     sizes.
> >>     >>>>>>>> Some
> >>     >>>>>>>> hints:
> >>     >>>>>>>>
> >>     >>>>>>>> ceph-disk defaults used in case it does not find
> >>     >>>>>>>> bluestore_block_wal_size or bluestore_block_db_size in
> >>     config file:
> >>     >>>>>>>>
> >>     >>>>>>>> wal =  512MB
> >>     >>>>>>>>
> >>     >>>>>>>> db = if bluestore_block_size (data size) is in config file it
> >>     >>>>>>>> uses 1/100
> >>     >>>>>>>> of it else it uses 1G.
> >>     >>>>>>>>
> >>     >>>>>>>> There is also a presentation by Sage back in March, see
> >>     page 16:
> >>     >>>>>>>>
> >>     >>>>>>>>
> >>     https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in
> >>     >>>>>>>>
> >>     >>>>>>>>
> >>     >>>>>>>>
> >>     >>>>>>>> wal: 512 MB
> >>     >>>>>>>>
> >>     >>>>>>>> db: "a few" GB
> >>     >>>>>>>>
> >>     >>>>>>>> the wal size is probably not debatable, it will be like a
> >>     >>>>>>>> journal for
> >>     >>>>>>>> small block sizes which are constrained by iops hence 512 MB is
> >>     >>>>>>>> more
> >>     >>>>>>>> than enough. Probably we will see more on the db size in the
> >>     >>>>>>>> future.
> >>     >>>>>>> This is what I understood so far.
> >>     >>>>>>> I wonder if it makes sense to set the db size as big as
> >>     possible and
> >>     >>>>>>> divide entire db device is  by the number of OSDs it will serve.
> >>     >>>>>>>
> >>     >>>>>>> E.g. 10 OSDs / 1 NVME (800GB)
> >>     >>>>>>>
> >>     >>>>>>>  (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD
> >>     >>>>>>>
> >>     >>>>>>> Is this smart/stupid?
> >>     >>>>>> Personally I'd use 512MB-2GB for the WAL (larger buffers
> >>     reduce write
> >>     >>>>>> amp but mean larger memtables and potentially higher overhead
> >>     >>>>>> scanning
> >>     >>>>>> through memtables).  4x256MB buffers works pretty well, but
> >>     it means
> >>     >>>>>> memory overhead too.  Beyond that, I'd devote the entire rest
> >>     of the
> >>     >>>>>> device to DB partitions.
> >>     >>>>>>
> >>     >>>>> thanks for your suggestion Mark!
> >>     >>>>>
> >>     >>>>> So, just to make sure I understood this right:
> >>     >>>>>
> >>     >>>>> You'd  use a separeate 512MB-2GB WAL partition for each OSD
> >>     and the
> >>     >>>>> entire rest for DB partitions.
> >>     >>>>>
> >>     >>>>> In the example case with 10xHDD OSD and 1 NVME it would then
> >>     be 10 WAL
> >>     >>>>> partitions with each 512MB-2GB and 10 equal sized DB partitions
> >>     >>>>> consuming the rest of the NVME.
> >>     >>>>>
> >>     >>>>>
> >>     >>>>> Thanks
> >>     >>>>>   Dietmar
> >>     >>>>> --
> >>     >>>>> _________________________________________
> >>     >>>>> D i e t m a r  R i e d e r, Mag.Dr.
> >>     >>>>> Innsbruck Medical University
> >>     >>>>> Biocenter - Division for Bioinformatics
> >>     >>>>>
> >>     >>>>>
> >>     >>>>>
> >>     >>>>> _______________________________________________
> >>     >>>>> ceph-users mailing list
> >>     >>>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> >>     >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>     >>>>>
> >>     >>>> _______________________________________________
> >>     >>>> ceph-users mailing list
> >>     >>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> >>     >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>     >>>
> >>     >>> _______________________________________________
> >>     >>> ceph-users mailing list
> >>     >>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> >>     >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>     >>
> >>     >> _______________________________________________
> >>     >> ceph-users mailing list
> >>     >> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> >>     >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>     >>
> >>     > _______________________________________________
> >>     > ceph-users mailing list
> >>     > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> >>     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>     --
> >>     _________________________________________
> >>     D i e t m a r  R i e d e r, Mag.Dr.
> >>     Innsbruck Medical University
> >>     Biocenter - Division for Bioinformatics
> >>
> >>     _______________________________________________
> >>     ceph-users mailing list
> >>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> >>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com