Re: Optimise Setup with Bluestore

Mehmet <ceph@xxxxxxxxxx> · Thu, 17 Aug 2017 23:19:50 +0200

Hey Mark :)

Am 16. August 2017 21:43:34 MESZ schrieb Mark Nelson <mnelson@xxxxxxxxxx>:
>Hi Mehmet!
>
>On 08/16/2017 11:12 AM, Mehmet wrote:
>> :( no suggestions or recommendations on this?
>>
>> Am 14. August 2017 16:50:15 MESZ schrieb Mehmet <ceph@xxxxxxxxxx>:
>>
>>     Hi friends,
>>
>>     my actual hardware setup per OSD-node is as follow:
>>
>>     # 3 OSD-Nodes with
>>     - 2x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz ==> 12 Cores, no
>>     Hyper-Threading
>>     - 64GB RAM
>>     - 12x 4TB HGST 7K4000 SAS2 (6GB/s) Disks as OSDs
>>     - 1x INTEL SSDPEDMD400G4 (Intel DC P3700 NVMe) as Journaling
>Device for
>>     12 Disks (20G Journal size)
>>     - 1x Samsung SSD 840/850 Pro only for the OS
>>
>>     # and 1x OSD Node with
>>     - 1x Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz (10 Cores 20
>Threads)
>>     - 64GB RAM
>>     - 23x 2TB TOSHIBA MK2001TRKB SAS2 (6GB/s) Disks as OSDs
>>     - 1x SEAGATE ST32000445SS SAS2 (6GB/s) Disk as OSDs
>>     - 1x INTEL SSDPEDMD400G4 (Intel DC P3700 NVMe) as Journaling
>Device for
>>     24 Disks (15G Journal size)
>>     - 1x Samsung SSD 850 Pro only for the OS
>
>The single P3700 for 23 spinning disks is pushing it.  They have high 
>write durability but based on the model that is the 400GB version?

Yes. It is a 400GB Version. 

>If you are doing a lot of writes you might wear it out pretty fast and

Actually the intel isdct tool says this One should alive 40 years ^^ (EnduranceAnalyzer). But this should be proofed ;)

>it's 
>a single point of failure for the entire node (if it dies you have a
>lot 
>of data dying with it).  General unbalanced setups like this are 
>trickier to get performing well as well.
>

Yes. That is true. That could be happen to All of my 4 Nodes. Perhaps the chef should see what will happen before i can get Money to optimise the Nodes...

>>
>>     As you can see, i am using 1 (one) NVMe (Intel DC P3700 NVMe –
>400G)
>>     Device for whole Spinning Disks (partitioned) on each OSD-node.
>>
>>     When „Luminous“ is available (as next LTE) i plan to switch vom
>>     „filestore“ to „bluestore“ 😊
>>
>>     As far as i have read bluestore consists of
>>     - „the device“
>>     - „block-DB“: device that store RocksDB metadata
>>     - „block-WAL“: device that stores RocksDB „write-ahead journal“
>>
>>     Which setup would be usefull in my case?
>>     I Would setup the disks via "ceph-deploy".
>
>So typically we recommend something like a 1-2GB WAL partition on the 
>NVMe drive per OSD and use the remaining space for DB.  If you run out 
>of DB space, bluestore will start using the spinning disks to store KV 
>data instead.  I suspect this will still be the advice you will want to
>
>follow, though at some point having so many WAL and DB partitions on
>the 
>NVMe may start becoming a bottleneck.  Something like 63K sequential 
>writes to heavily fragmented objects might be worth testing, but in
>most 
>cases I suspect DB and WAL on NVMe is still going to be faster.
>

Thanks thats what i expected. Another idea would be to replace a Spinning Disk of the Nodes with an intel ssd for wal/db... Perhaps for the dbs?

- Mehmet

>>
>>     Thanks in advance for your suggestions!
>>     - Mehmet
>>    
>------------------------------------------------------------------------
>>
>>     ceph-users mailing list
>>     ceph-users@xxxxxxxxxxxxxx
>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>_______________________________________________
>ceph-users mailing list
>ceph-users@xxxxxxxxxxxxxx
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com