Re: Bluestore disk colocation using NVRAM, SSD and SATA

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 21 Sep 2017 09:17:43 -0500

On 09/21/2017 03:19 AM, Maged Mokhtar wrote:
On 2017-09-21 10:01, Dietmar Rieder wrote:

Hi,

I'm in the same situation (NVMEs, SSDs, SAS HDDs). I asked the same
questions to myself.
For now I decided to use the NVMEs as wal and db devices for the SAS
HDDs and on the SSDs I colocate wal and  db.

However, I'm still wonderin how (to what size) and if I should change
the default sizes of wal and db.

Dietmar

On 09/21/2017 01:18 AM, Alejandro Comisario wrote:
But for example, on the same server i have 3 disks technologies to
deploy pools, SSD, SAS and SATA.
The NVME were bought just thinking on the journal for SATA and SAS,
since journals for SSD were colocated.

But now, exactly the same scenario, should i trust the NVME for the SSD
pool ? are there that much of a  gain ? against colocating block.* on
the same SSD?

best.

On Wed, Sep 20, 2017 at 6:36 PM, Nigel Williams
<nigel.williams@xxxxxxxxxxx <mailto:nigel.williams@xxxxxxxxxxx>
<mailto:nigel.williams@xxxxxxxxxxx
<mailto:nigel.williams@xxxxxxxxxxx>>> wrote:

    On 21 September 2017 at 04:53, Maximiliano Venesio
    <massimo@xxxxxxxxxxx <mailto:massimo@xxxxxxxxxxx>
<mailto:massimo@xxxxxxxxxxx <mailto:massimo@xxxxxxxxxxx>>> wrote:

        Hi guys i'm reading different documents about bluestore, and it
        never recommends to use NVRAM to store the bluefs db,
        nevertheless the official documentation says that, is better to
        use the faster device to put the block.db in.

    Likely not mentioned since no one yet has had the opportunity to
    test it.

        So how do i have to deploy using bluestore, regarding where i
        should put block.wal and block.db ?

    block.* would be best on your NVRAM device, like this:

    ceph-deploy osd create --bluestore c0osd-136:/dev/sda --block-wal
    /dev/nvme0n1 --block-db /dev/nvme0n1

    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
<mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
    <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

--
*Alejandro Comisario*
*CTO | NUBELIU*
E-mail: alejandro@xxxxxxxxxxx <mailto:alejandro@xxxxxxxxxxx>
<mailto:alejandro@xxxxxxxxxxx <mailto:alejandro@xxxxxxxxxxx>>Cell: +54 9
11 3770 1857
_
www.nubeliu.com <http://www.nubeliu.com> <http://www.nubeliu.com/>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

My guess is for wal: you are dealing with a 2 step io operation so in
case it is collocated on your SSDs your iops for small writes will be
halfed. The decision is if you add a small NVMEs as wal for 4 or 5
(large) SSDs, you will double their iops for small io sized. This is not
the case for db.

For wal size:  512 MB is recommended ( ceph-disk default )

For db size: a "few" GB..probably 10GB is a good number. I guess we will
hear more in the future.

There's a pretty good chance that if you are writing out lots of small 
RGW or rados objects you'll blow past 10GB of metadata once rocksdb 
space-amp is factored in.  I can pretty routinely do it when writing out 
millions of rados objects per OSD.  Bluestore will switch to write 
metadata out to the block disk and in this case it might not be that bad 
of a transition (NVMe to SSD).  If you have spare room, you might as 
well give the DB partition whatever you have available on the device.  A 
harder question is how much fast storage to buy for the WAL/DB.  It's 
not straight forward, and rocksdb can be tuned in various ways to favor 
reducing space/write/read amplification, but not all 3 at once.  Right 
now we are likely favoring reducing write-amplification over space/read 
amp, but one could imagine that with a small amount of incredibly fast 
storage it might be better to favor reducing space-amp.

Mark

Maged Mokhtar

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com