Re: [SSD NVM FOR JOURNAL] Performance issues

Christian Balzer <chibi@xxxxxxx> · Thu, 24 Aug 2017 09:34:19 +0900

Hello,

On Wed, 23 Aug 2017 09:11:18 -0300 Guilherme Steinmüller wrote:

> Hello!
> 
> I recently installed INTEL SSD 400GB 750 SERIES PCIE 3.0 X4 in 3 of my OSD
> nodes.
> 
Well, you know what's coming now, don't you?

That's a consumer device, with 70GB writes per day endurance.
unless you're essentially having a read-only cluster, you're throwing away
money. 

> First of all, here's is an schema describing how my cluster is:
> 
> [image: Imagem inline 1]
> 
> [image: Imagem inline 2]
> 
> I primarily use my ceph as a beckend for OpenStack nova, glance, swift and
> cinder. My crushmap is configured to have rulesets for SAS disks, SATA
> disks and another ruleset that resides in HPE nodes using SATA disks too.
> 
> Before installing the new journal in HPE nodes, i was using one of the
> disks that today are OSDs (osd.35, osd.34 and osd.33). After upgrading the
> journal, i noticed that a dd command writing 1gb blocks in openstack nova
> instances doubled the throughput but the value expected was actually 400%
> or 500% since in the Dell nodes that we have another nova pool the
> throughput is around this value.
> 
Apples, oranges and bananas. 
You're comparing different HW (and no, I'm not going to look this up)
which may or may not have vastly different capabilities (like HW cache),
RAM and (unlikely relevant) CPU. 
Your NVMe may also be plugged into a different, insufficient PCIe slot for
all we know.
You're also using very different HDDs, which definitely will be a factor.

But most importanly you're comparing 2 pools of vastly different ODS
count, no wonder a pool with 15 OSDs is faster in sequential writes than
one with 9. 

> Here is a demonstration of the scenario and the difference in performance
> between Dell nodes and HPE nodes:
> 
> 
> 
> Scenario:
> 
> 
>    -    Using pools to store instance disks for OpenStack
> 
> 
>    -     Pool nova in "ruleset SAS" placed on c4-osd201, c4-osd202 and
>    c4-osd203 with 5 osds per hosts
> 
SAS
> 
>    -     Pool nova_hpedl180 in "ruleset NOVA_HPEDL180" placed on c4-osd204,
>    c4-osd205, c4-osd206 with 3 osds per hosts
> 
SATA
> 
>    -     Every OSD has one partition of 35GB in a INTEL SSD 400GB 750
>    SERIES PCIE 3.0 X4
> 
Overkill, but since your NVMe will die shortly anyway...

With large sequential tests, the journal will have nearly NO impact on the
result, even if tuned to that effect.

> 
>    -     Internal link for cluster and public network of 10Gbps
> 
> 
>    -     Deployment via ceph-ansible. Same configuration define in ansible
>    for every host on cluster
> 
> 
> 
> *Instance on pool nova in ruleset SAS:*
> 
> 
>    # dd if=/dev/zero of=/mnt/bench bs=1G count=1 oflag=direct
>        1+0 records in
>        1+0 records out
>        1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.56255 s, 419 MB/s
> 
This is a very small test for what you're trying to determine and not
going to be very representative. 
If for example there _is_ a HW cache of 2GB on the Dell nodes, it would
fit nicely in there.

> 
> *Instance on pool nova in ruleset NOVA_HPEDL180:*
> 
>      #  dd if=/dev/zero of=/mnt/bench bs=1G count=1 oflag=direct
>      1+0 records in
>      1+0 records out
>      1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.8243 s, 90.8 MB/s
> 
> 
> I made some FIO benchmarks as suggested by Sebastien (
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-
> test-if-your-ssd-is-suitable-as-a-journal-device/ ) and the command with 1
> job returned me about 180MB/s of throughput in recently installed nodes
> (HPE nodes). I made some hdparm benchmark in all SSDs and everything seems
> normal.
> 
I'd consider a 180MB/s result from a device that supposedly does 900MB/s a
fail, but then again those tests above do NOT reflect journal usage
reality but a more of a hint if something is totally broken or not.

> 
> I can't see what is causing this difference of throughput since the network
> is not a problem and i think that cpu and memory are not crucial since i
> was monitoring the cluster with atop command and i didn't notice saturation
> of resources. My only though is that I have less workload in nova_hpedl180
> pool in HPE nodes and less disks per node and this ca influence in the
> throughput of the journal.
>
How busy are your NVMe journals during that test on the Dells and HPs
respectively?
Same for the HDDs.

Again, run longer, larger tests to get something that will actually
register, also atop with shorter intervals.

Christian
> 
> Any clue about what is missing or what is happening?
> 
> Thanks in advance.

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com