On 09/25/2017 02:59 PM, Mark Nelson wrote: > On 09/25/2017 03:31 AM, TYLin wrote: >> Hi, >> >> To my understand, the bluestore write workflow is >> >> For normal big write >> 1. Write data to block >> 2. Update metadata to rocksdb >> 3. Rocksdb write to memory and block.wal >> 4. Once reach threshold, flush entries in block.wal to block.db >> >> For overwrite and small write >> 1. Write data and metadata to rocksdb >> 2. Apply the data to block >> >> Seems we don’t have a formula or suggestion to the size of block.db. >> It depends on the object size and number of objects in your pool. You >> can just give big partition to block.db to ensure all the database >> files are on that fast partition. If block.db full, it will use block >> to put db files, however, this will slow down the db performance. So >> give db size as much as you can. > > This is basically correct. What's more, it's not just the object size, > but the number of extents, checksums, RGW bucket indices, and > potentially other random stuff. I'm skeptical how well we can estimate > all of this in the long run. I wonder if we would be better served by > just focusing on making it easy to understand how the DB device is being > used, how much is spilling over to the block device, and make it easy to > upgrade to a new device once it gets full. > >> >> If you want to put wal and db on same ssd, you don’t need to create >> block.wal. It will implicitly use block.db to put wal. The only case >> you need block.wal is that you want to separate wal to another disk. > > I always make explicit partitions, but only because I (potentially > illogically) like it that way. There may actually be some benefits to > using a single partition for both if sharing a single device. is this "Single db/wal partition" then to be used for all OSDs on a node or do you need to create a seperate "Single db/wal partition" for each OSD on the node? > >> >> I’m also studying bluestore, this is what I know so far. Any >> correction is welcomed. >> >> Thanks >> >> >>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh >>> <richard.hesketh@xxxxxxxxxxxx> wrote: >>> >>> I asked the same question a couple of weeks ago. No response I got >>> contradicted the documentation but nobody actively confirmed the >>> documentation was correct on this subject, either; my end state was >>> that I was relatively confident I wasn't making some horrible mistake >>> by simply specifying a big DB partition and letting bluestore work >>> itself out (in my case, I've just got HDDs and SSDs that were >>> journals under filestore), but I could not be sure there wasn't some >>> sort of performance tuning I was missing out on by not specifying >>> them separately. >>> >>> Rich >>> >>> On 21/09/17 20:37, Benjeman Meekhof wrote: >>>> Some of this thread seems to contradict the documentation and confuses >>>> me. Is the statement below correct? >>>> >>>> "The BlueStore journal will always be placed on the fastest device >>>> available, so using a DB device will provide the same benefit that the >>>> WAL device would while also allowing additional metadata to be stored >>>> there (if it will fix)." >>>> >>>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices >>>> >>>> >>>> it seems to be saying that there's no reason to create separate WAL >>>> and DB partitions if they are on the same device. Specifying one >>>> large DB partition per OSD will cover both uses. >>>> >>>> thanks, >>>> Ben >>>> >>>> On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder >>>> <dietmar.rieder@xxxxxxxxxxx> wrote: >>>>> On 09/21/2017 05:03 PM, Mark Nelson wrote: >>>>>> >>>>>> On 09/21/2017 03:17 AM, Dietmar Rieder wrote: >>>>>>> On 09/21/2017 09:45 AM, Maged Mokhtar wrote: >>>>>>>> On 2017-09-21 07:56, Lazuardi Nasution wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I'm still looking for the answer of these questions. Maybe >>>>>>>>> someone can >>>>>>>>> share their thought on these. Any comment will be helpful too. >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> >>>>>>>>> On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution >>>>>>>>> <mrxlazuardin@xxxxxxxxx <mailto:mrxlazuardin@xxxxxxxxx>> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> 1. Is it possible configure use osd_data not as small >>>>>>>>> partition on >>>>>>>>> OSD but a folder (ex. on root disk)? If yes, how to do that >>>>>>>>> with >>>>>>>>> ceph-disk and any pros/cons of doing that? >>>>>>>>> 2. Is WAL & DB size calculated based on OSD size or expected >>>>>>>>> throughput like on journal device of filestore? If no, what >>>>>>>>> is the >>>>>>>>> default value and pro/cons of adjusting that? >>>>>>>>> 3. Is partition alignment matter on Bluestore, including >>>>>>>>> WAL & DB >>>>>>>>> if using separate device for them? >>>>>>>>> >>>>>>>>> Best regards, >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list >>>>>>>>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>> >>>>>>>> >>>>>>>> I am also looking for recommendations on wal/db partition sizes. >>>>>>>> Some >>>>>>>> hints: >>>>>>>> >>>>>>>> ceph-disk defaults used in case it does not find >>>>>>>> bluestore_block_wal_size or bluestore_block_db_size in config file: >>>>>>>> >>>>>>>> wal = 512MB >>>>>>>> >>>>>>>> db = if bluestore_block_size (data size) is in config file it >>>>>>>> uses 1/100 >>>>>>>> of it else it uses 1G. >>>>>>>> >>>>>>>> There is also a presentation by Sage back in March, see page 16: >>>>>>>> >>>>>>>> https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> wal: 512 MB >>>>>>>> >>>>>>>> db: "a few" GB >>>>>>>> >>>>>>>> the wal size is probably not debatable, it will be like a >>>>>>>> journal for >>>>>>>> small block sizes which are constrained by iops hence 512 MB is >>>>>>>> more >>>>>>>> than enough. Probably we will see more on the db size in the >>>>>>>> future. >>>>>>> This is what I understood so far. >>>>>>> I wonder if it makes sense to set the db size as big as possible and >>>>>>> divide entire db device is by the number of OSDs it will serve. >>>>>>> >>>>>>> E.g. 10 OSDs / 1 NVME (800GB) >>>>>>> >>>>>>> (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD >>>>>>> >>>>>>> Is this smart/stupid? >>>>>> Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write >>>>>> amp but mean larger memtables and potentially higher overhead >>>>>> scanning >>>>>> through memtables). 4x256MB buffers works pretty well, but it means >>>>>> memory overhead too. Beyond that, I'd devote the entire rest of the >>>>>> device to DB partitions. >>>>>> >>>>> thanks for your suggestion Mark! >>>>> >>>>> So, just to make sure I understood this right: >>>>> >>>>> You'd use a separeate 512MB-2GB WAL partition for each OSD and the >>>>> entire rest for DB partitions. >>>>> >>>>> In the example case with 10xHDD OSD and 1 NVME it would then be 10 WAL >>>>> partitions with each 512MB-2GB and 10 equal sized DB partitions >>>>> consuming the rest of the NVME. >>>>> >>>>> >>>>> Thanks >>>>> Dietmar >>>>> -- >>>>> _________________________________________ >>>>> D i e t m a r R i e d e r, Mag.Dr. >>>>> Innsbruck Medical University >>>>> Biocenter - Division for Bioinformatics >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- _________________________________________ D i e t m a r R i e d e r, Mag.Dr. Innsbruck Medical University Biocenter - Division for Bioinformatics
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com