Re: Separate BlueStore WAL/DB : best scenario ?

Hervé Ballans <herve.ballans@xxxxxxxxxxxxx> · Thu, 22 Mar 2018 10:53:34 +0100

Le 21/03/2018 à 11:48, Ronny Aasen a écrit :
On 21. mars 2018 11:27, Hervé Ballans wrote:
Hi all,

I have a question regarding a possible scenario to put both wal and 
db in a separate SSD device for an OSD node composed by 22 OSDs (HDD 
SAS 10k 1,8 To).

I'm thinking of 2 options (at about the same price) :

- add 2 SSD SAS Write Intensive (10DWPD)

- or add a unique SSD NVMe 800 Go (it's the minimum capacity 
currently on the market !..)

In both case, that's a lot of partitions on each SSD disk, especially 
on the second solution where we would have 44 partitions (22 WAL and 
22 DB) !

Is this solution workable (I mean in term of i/o speeds), or is it 
unsafe despite the high PCIe bus transfer rate ?

I just want to talk here about throughput performances, not data 
integrity on the node in case of SSD crashes...

Thanks in advance for your advices,

if you put the wal and db on the same device anyway, there is no real 
benefit to having a partition for each. the reason you can split them 
up is if you have them on different devices. Eg db on ssd, but wal on 
nvram. it is easier to just colocat wal and db into the same partition 
since they live on the same device in your case anyway.

if you have too many osd's db's on the same ssd, you may end up with 
the ssd beeing the bottleneck. 4 osd's db's on a ssd have been a 
"golden rule" on the mailinglist for a while. for nvram you can 
possibly have some more.

but the bottleneck is only one part of the problem. when the 22 
partitions db nvram dies, it brings down 22 osd's at once and will be 
a huge pain on your cluster. (depending on how large it is...)
i would spread the db's on more devices to reduce the bottleneck and 
failure domains in this situation.

Hi Ronny,

Thank you for your clear answer.
OK for putting both wal and db on the same partition, I didn't have this 
information, but indeed it seems more interesting in my case (in 
particular if I choose the fastest device, i.e. NVMe*)

I plan to have 6 OSDs nodes (same configuration for each) but I don't 
know yet if I will use replication (x3) or Erasure Coding (4+2 ?) pools. 
Also in both cases, I could eventually accept the loss of a node on a 
reduced time (replacement of the journals disk + OSDs reconfiguration).

But you're right, I will start on a configuration where I will spread 
the db's on at least 2 fast disks.

Regards,
Hervé
* Just for information, I look closely at the SAMSUNG PM1725 NVMe PCIe 
SSD. The (theorical) technical specifications seem interesting, 
especilly on the IOPS : up to 750K IOPS for Random Read and 120K IOPS 
for Random Write...

kind regards
Ronny Aasen

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com