Re: Ceph performances

Rémi BUISSON <remi-buisson@xxxxxxxxx> · Wed, 25 Nov 2015 13:42:47 +0100

Hello Hugo,

Yes you're right. With Sebastien Han fio command I manage to see that my 
disks can finally handle 100 Kiops, so the theoritical value is then: 2 
x 2 x 100 / 2 = 200k.

I put the journal on the SSDSC2BX016T4R which is then supposed to double 
my IOs, but it's not the case.

Rémi

Le 2015-11-08 07:06, Hugo Slabbert a écrit :
On Sat 2015-Nov-07 09:24:06 +0100, Rémi BUISSON 
<remi-buisson@xxxxxxxxx> wrote:

Hi guys,

I would need your help to figure out performance issues on my ceph 
cluster.
I've read pretty much every thread on the net concerning this topic 
but I didn't manage to have acceptable performances.
In my company, we are planning to replace our existing virtualization 
infrastucture NAS by a ceph cluster in order to improve the global 
platform performances, scalability and security. The current NAS we 
have handle about 50k iops.

For this we bought:
2 x NFS servers: 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 32 GB 
RAM, 2 x 10Gbps network interfaces (bonding)
3 x MON servers: 1 x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz, 16 GB 
RAM, 2 x 10Gbps network interfaces (bonding)
2 x MDS servers: 2 x Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz, 32 GB 
RAM, 2 x 10Gbps network interfaces (bonding)
2 x OSD servers (cache): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 
2.40GHz, 256 GB RAM, 2 x SSD INTEL SSDSC2BX200G4 (200 GB) for journal, 
6 x SSD INTEL SSDSC2BX016T4R (1,4 TB) for data, 2 x 10Gbps network 
interfaces (bonding)
4 x OSD servers (storage): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 
2.40GHz, 256 GB RAM, 4 x SSD TOSHIBA PX02SMF020 (200GB) for journal, 
18 x HGST Ultrastar HUC101818CS4204 (1.8TB) for data, 2 x 10Gbps 
network interfaces (bonding)

The total of this is 84 OSDs.

I created two 4096 pgs pools, one called rbd-cold-storage and the 
other rbd-hot-storage. As you may guess, the rbd-cold-storage is 
composed of the 4 OSD servers with platter disks and the 
rbd-hot-storage is composed of the 2 OSD servers with SSD disks.
On the rdb-cold-storage, I created an rbd device which is mapped on 
the NFS server.

I benched each of the SSD we have and it can handle 40k iops each. As 
my replication factor is 2, the theoritical performance of the cluster 
is (2 x 6 (OSD cache) x 40k) / 2 = 240k iops.

Aside from the other more detailed replies re: tuning, isn't the
layout of the caching tier journals sub-optimal in this scenario?
Given the similar model numbers there, I'm assuming the performance
(throughput, IOPS) of the journal & data disks are similar, but please
correct me if I'm wrong there.

My understanding of ceph's design (newer to ceph; please excuse
misunderstandings) is that writes pass through the journals, the OSD
will ACK writes when they are committed to the journal(s) of the OSDs
in that PG (so, one other OSD in this case with a replication factor
of 2), and journals are then flushed to OSDs asynchronously.

Rather than "(2 x 6 (OSD cache) x 40k) / 2 = 240k iops", isn't the
calculation actually:

(# hosts) x (# journal disks) x (IOPS per journal disk) / (replication 
factor) ?

IOW:
(2 x 2 (OSD cache journal SSDs) x 40K) / 2 = 80K

Yes, putting journals on the same disk as the OSD's data halves your
write performance because data has to flush from the journal partition
to the data partition on the same SSD, but in this case would it not
be more optimal to just chuck the 2x SSDSC2BX200G4 per cache host,
replace them with 2x more data disks (SSDSC2BX016T4R) for 8 total per
cache OSD host, and then go with journals on the same disk?

In that case we're looking at:

(# hosts) x (# journal disks) x (IOPS per journal disk) / (replication
factor) / 2

...where the final division by 2 is our write penalty for sharing
journal and data on the same disk.

So, in this scenario:

2 x 8 (OSD cache SSDs) x 40K / 2 (replication factor) / 2 = 160K

Yes/no?

In a regular "SSD journals + spinners for data" setup, journals on
discrete/partitioned SSDs makes sense in e.g. a 3:1 ratio as your
performance (well, throughput; IOPS is another story) on the SSD will
generally be ~3x what your SAS/SATA spinners can do.  So: 1 SSD has 3x
partitions and serves journals for 3x OSDs backed by spinners, the
numbers are matched up so that it has the capacity to absorb (write)
data as fast as it can flush it down to the spinners and it can pretty
much max the spinnners' write capacity.  Overload the SSD with too
many journals and it will be maxed with spinners sitting waiting/idle.

But in scenarios where the performance of the journal SSDs matches the
performance of the backing disks for the OSDs and with a 3:1 ratio on
data disks to journal disks, the data SSDs will still have more write
performance capacity to spare while the journal SSD is maxed.

Don't we need something with greater throughput/IOPS in the journal
than in our data partition in order to make discrete journals be of
benefit?

I guess the alternative to swapping the 2x SSDSC2BX200G4 journals in
the cache for simply more data disks (SSDSC2BX016T4R) would be to go
PCIe/NVMe for the journals in the cache layer, at which point the
discrete journals could be a net plus again?
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com