thanks David, that's confirming what I was assuming. To bad that there is no estimate/method to calculate the db partition size. Dietmar On 09/25/2017 05:10 PM, David Turner wrote: > db/wal partitions are per OSD. DB partitions need to be made as big as > you need them. If they run out of space, they will fall back to the > block device. If the DB and block are on the same device, then there's > no reason to partition them and figure out the best size. If they are > on separate devices, then you need to make it as big as you need to to > ensure that it won't spill over (or if it does that you're ok with the > degraded performance while the db partition is full). I haven't come > across an equation to judge what size should be used for either > partition yet. > > On Mon, Sep 25, 2017 at 10:53 AM Dietmar Rieder > <dietmar.rieder@xxxxxxxxxxx <mailto:dietmar.rieder@xxxxxxxxxxx>> wrote: > > On 09/25/2017 02:59 PM, Mark Nelson wrote: > > On 09/25/2017 03:31 AM, TYLin wrote: > >> Hi, > >> > >> To my understand, the bluestore write workflow is > >> > >> For normal big write > >> 1. Write data to block > >> 2. Update metadata to rocksdb > >> 3. Rocksdb write to memory and block.wal > >> 4. Once reach threshold, flush entries in block.wal to block.db > >> > >> For overwrite and small write > >> 1. Write data and metadata to rocksdb > >> 2. Apply the data to block > >> > >> Seems we don’t have a formula or suggestion to the size of block.db. > >> It depends on the object size and number of objects in your pool. You > >> can just give big partition to block.db to ensure all the database > >> files are on that fast partition. If block.db full, it will use block > >> to put db files, however, this will slow down the db performance. So > >> give db size as much as you can. > > > > This is basically correct. What's more, it's not just the object > size, > > but the number of extents, checksums, RGW bucket indices, and > > potentially other random stuff. I'm skeptical how well we can > estimate > > all of this in the long run. I wonder if we would be better served by > > just focusing on making it easy to understand how the DB device is > being > > used, how much is spilling over to the block device, and make it > easy to > > upgrade to a new device once it gets full. > > > >> > >> If you want to put wal and db on same ssd, you don’t need to create > >> block.wal. It will implicitly use block.db to put wal. The only case > >> you need block.wal is that you want to separate wal to another disk. > > > > I always make explicit partitions, but only because I (potentially > > illogically) like it that way. There may actually be some benefits to > > using a single partition for both if sharing a single device. > > is this "Single db/wal partition" then to be used for all OSDs on a node > or do you need to create a seperate "Single db/wal partition" for each > OSD on the node? > > > > >> > >> I’m also studying bluestore, this is what I know so far. Any > >> correction is welcomed. > >> > >> Thanks > >> > >> > >>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh > >>> <richard.hesketh@xxxxxxxxxxxx > <mailto:richard.hesketh@xxxxxxxxxxxx>> wrote: > >>> > >>> I asked the same question a couple of weeks ago. No response I got > >>> contradicted the documentation but nobody actively confirmed the > >>> documentation was correct on this subject, either; my end state was > >>> that I was relatively confident I wasn't making some horrible > mistake > >>> by simply specifying a big DB partition and letting bluestore work > >>> itself out (in my case, I've just got HDDs and SSDs that were > >>> journals under filestore), but I could not be sure there wasn't some > >>> sort of performance tuning I was missing out on by not specifying > >>> them separately. > >>> > >>> Rich > >>> > >>> On 21/09/17 20:37, Benjeman Meekhof wrote: > >>>> Some of this thread seems to contradict the documentation and > confuses > >>>> me. Is the statement below correct? > >>>> > >>>> "The BlueStore journal will always be placed on the fastest device > >>>> available, so using a DB device will provide the same benefit > that the > >>>> WAL device would while also allowing additional metadata to be > stored > >>>> there (if it will fix)." > >>>> > >>>> > http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices > >>>> > >>>> > >>>> it seems to be saying that there's no reason to create > separate WAL > >>>> and DB partitions if they are on the same device. Specifying one > >>>> large DB partition per OSD will cover both uses. > >>>> > >>>> thanks, > >>>> Ben > >>>> > >>>> On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder > >>>> <dietmar.rieder@xxxxxxxxxxx > <mailto:dietmar.rieder@xxxxxxxxxxx>> wrote: > >>>>> On 09/21/2017 05:03 PM, Mark Nelson wrote: > >>>>>> > >>>>>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote: > >>>>>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote: > >>>>>>>> On 2017-09-21 07:56, Lazuardi Nasution wrote: > >>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> I'm still looking for the answer of these questions. Maybe > >>>>>>>>> someone can > >>>>>>>>> share their thought on these. Any comment will be helpful too. > >>>>>>>>> > >>>>>>>>> Best regards, > >>>>>>>>> > >>>>>>>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution > >>>>>>>>> <mrxlazuardin@xxxxxxxxx <mailto:mrxlazuardin@xxxxxxxxx> > <mailto:mrxlazuardin@xxxxxxxxx <mailto:mrxlazuardin@xxxxxxxxx>>> wrote: > >>>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> 1. Is it possible configure use osd_data not as small > >>>>>>>>> partition on > >>>>>>>>> OSD but a folder (ex. on root disk)? If yes, how to do > that > >>>>>>>>> with > >>>>>>>>> ceph-disk and any pros/cons of doing that? > >>>>>>>>> 2. Is WAL & DB size calculated based on OSD size or > expected > >>>>>>>>> throughput like on journal device of filestore? If no, > what > >>>>>>>>> is the > >>>>>>>>> default value and pro/cons of adjusting that? > >>>>>>>>> 3. Is partition alignment matter on Bluestore, including > >>>>>>>>> WAL & DB > >>>>>>>>> if using separate device for them? > >>>>>>>>> > >>>>>>>>> Best regards, > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> ceph-users mailing list > >>>>>>>>> ceph-users@xxxxxxxxxxxxxx > <mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxxxxxx > <mailto:ceph-users@xxxxxxxxxxxxxx>> > >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>>>>>> > >>>>>>>> > >>>>>>>> I am also looking for recommendations on wal/db partition > sizes. > >>>>>>>> Some > >>>>>>>> hints: > >>>>>>>> > >>>>>>>> ceph-disk defaults used in case it does not find > >>>>>>>> bluestore_block_wal_size or bluestore_block_db_size in > config file: > >>>>>>>> > >>>>>>>> wal = 512MB > >>>>>>>> > >>>>>>>> db = if bluestore_block_size (data size) is in config file it > >>>>>>>> uses 1/100 > >>>>>>>> of it else it uses 1G. > >>>>>>>> > >>>>>>>> There is also a presentation by Sage back in March, see > page 16: > >>>>>>>> > >>>>>>>> > https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> wal: 512 MB > >>>>>>>> > >>>>>>>> db: "a few" GB > >>>>>>>> > >>>>>>>> the wal size is probably not debatable, it will be like a > >>>>>>>> journal for > >>>>>>>> small block sizes which are constrained by iops hence 512 MB is > >>>>>>>> more > >>>>>>>> than enough. Probably we will see more on the db size in the > >>>>>>>> future. > >>>>>>> This is what I understood so far. > >>>>>>> I wonder if it makes sense to set the db size as big as > possible and > >>>>>>> divide entire db device is by the number of OSDs it will serve. > >>>>>>> > >>>>>>> E.g. 10 OSDs / 1 NVME (800GB) > >>>>>>> > >>>>>>> (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD > >>>>>>> > >>>>>>> Is this smart/stupid? > >>>>>> Personally I'd use 512MB-2GB for the WAL (larger buffers > reduce write > >>>>>> amp but mean larger memtables and potentially higher overhead > >>>>>> scanning > >>>>>> through memtables). 4x256MB buffers works pretty well, but > it means > >>>>>> memory overhead too. Beyond that, I'd devote the entire rest > of the > >>>>>> device to DB partitions. > >>>>>> > >>>>> thanks for your suggestion Mark! > >>>>> > >>>>> So, just to make sure I understood this right: > >>>>> > >>>>> You'd use a separeate 512MB-2GB WAL partition for each OSD > and the > >>>>> entire rest for DB partitions. > >>>>> > >>>>> In the example case with 10xHDD OSD and 1 NVME it would then > be 10 WAL > >>>>> partitions with each 512MB-2GB and 10 equal sized DB partitions > >>>>> consuming the rest of the NVME. > >>>>> > >>>>> > >>>>> Thanks > >>>>> Dietmar > >>>>> -- > >>>>> _________________________________________ > >>>>> D i e t m a r R i e d e r, Mag.Dr. > >>>>> Innsbruck Medical University > >>>>> Biocenter - Division for Bioinformatics > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> ceph-users mailing list > >>>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > _________________________________________ > D i e t m a r R i e d e r, Mag.Dr. > Innsbruck Medical University > Biocenter - Division for Bioinformatics > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- _________________________________________ D i e t m a r R i e d e r, Mag.Dr. Innsbruck Medical University Biocenter - Division for Bioinformatics
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com