Re: CEPH IOPS Baseline Measurements with MemStore

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Thu, 19 Jun 2014 13:08:09 +0200 (CEST)

>>I am not sure if it is actually possible to disable completely all log messages. I did this for benchmarking at compile time changing the logging macro in common/dout.h ==> #define dout_impl(cct, sub, v) .... 

I think it can be done in ceph.conf
https://ceph.com/docs/master/rados/troubleshooting/log-and-debug/#subsystem-log-and-debug-settings

I remember an old mail from stefan priebe from 2012, reporting also a performance decrease with logging

https://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg09976.html

with a cpu trace here:
https://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg09974/out.pdf

ceph.conf to disable them was:

debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0

----- Mail original ----- 

De: "Andreas Joachim Peters" <Andreas.Joachim.Peters@xxxxxxx> 
À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx> 
Cc: ceph-devel@xxxxxxxxxxxxxxx 
Envoyé: Jeudi 19 Juin 2014 11:29:27 
Objet: RE: CEPH IOPS Baseline Measurements with MemStore 

I am not sure if it is actually possible to disable completely all log messages. I did this for benchmarking at compile time changing the logging macro in common/dout.h ==> #define dout_impl(cct, sub, v) .... 

I changed 'osd op threads' but that had no visible impact. 

Cheers Andreas. 

________________________________________ 
From: Alexandre DERUMIER [aderumier@xxxxxxxxx] 
Sent: 19 June 2014 11:21 
To: Andreas Joachim Peters 
Cc: ceph-devel@xxxxxxxxxxxxxxx 
Subject: Re: CEPH IOPS Baseline Measurements with MemStore 

Hi, 

Thanks for your benchmark ! 

>>If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know! 

>>1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms] 
how do you enable|disable stats ? (ceph.conf) 

>>2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model? 
It's quite possible, I have see a lot of benchmark with ssd, and osd daemon was always the bottleneck, more osd more scale. 

>>3) a writing OSD never fills more than 4 cores 
>>4) a reading OSD never fills more than 5 cores 

maybe "osd op threads" could improve this ? 
default is 2 (don't known if with hyperthreading it's going on 4cores instead 2 ?) 

----- Mail original ----- 

De: "Andreas Joachim Peters" <Andreas.Joachim.Peters@xxxxxxx> 
À: ceph-devel@xxxxxxxxxxxxxxx 
Envoyé: Jeudi 19 Juin 2014 11:05:18 
Objet: CEPH IOPS Baseline Measurements with MemStore 

Hi, 

I made some benchmarks/testing using the firefly branch and GCC 4.9. Hardware is 2 CPUs with 6-core Intel(R) Xeon(R) CPU E5-2630L 0 @ 2.00GHz with Hyperthreading and 256 GB of memory (kernel 2.6.32-431.17.1.el6.x86_64). 

In my tests I run two OSD configurations on a single box: 

[A] 4 OSDs running with MemStore 
[B] 1 OSD running with MemStore 

I use a pool with 'size=1' and read and read/write 1-byte objects all via localhost. 

The local RTT reported by ping is 15 micro seconds, the RTT measured with ZMQ is 100 micro seconds (10 kHZ synchronous 1-byte messages). 
RTT measured with another file IO daemon (XRootD) we are using at CERN (31-byte messages) is 9.9 kHZ. 

------------------------------------------------------------------------------------------------------------------------- 
4 OSDs 
------------------------------------------------------------------------------------------------------------------------- 

{1} [A] 
******* 
I measure IOPS with 1 byte objects for separate write and read operations disabling logging of any subsystem: 

Type : IOPS[kHz] : Latency [ms] : ConcurIO [#] 
=================================== 
Write : 01.7 : 0.50 : 1 
Write: 11.2 : 0.88 : 10 
Write: 11.8 : 1.69 : 10 x 2 [ 2 rados bench processes ] 
Write: 11.2 : 3.57 : 10 x 4 [ 4 rados bench processes ] 
Read : 02.6 : 0.33 : 1 
Read : 22.4 : 0.43 : 10 
Read : 40.0 : 0.97 : 20 x 2 [ 2 rados bench processes ] 
Read : 46.0 : 0.88 : 10 x 4 [ 4 rados bench processes ] 
Read : 40.0 : 1.60 : 20 x 4 [ 4 rados bench processes ] 

{2} [A] 
******* 
I measure IOPS with the CEPH firefly branch as is (default logging) : 

Type : IOPS[kHz] : Latency [ms] : ConcurIO [#] 
=================================== 
Write : 01.2 : 0.78 : 1 
Write : 09.1 : 1.00 : 10 
Read : 01.8 : 0.50 : 1 
Read : 14.0 : 1.00 : 10 
Read : 18:0 : 2.00 : 20 x 2 [ 2 rados bench processes ] 
Read : 18.0 : 2.20 : 10 x 4 [ 4 rados bench processes ] 

------------------------------------------------------------------------------------------------------------------------- 
1 OSD 
------------------------------------------------------------------------------------------------------------------------- 

{1} [B] (subsys logging disabled, 1 OSD) 
******* 
Write : 02.0 : 0.46 : 1 
Write : 10.0 : 0.95 : 10 
Write : 11.1 : 1.74 : 20 
Write : 12.0 : 1.80 : 10 x 2 [ 2 rados bench processes ] 
Write : 10.8 : 3.60 : 10 x 4 [ 4 rados bench processes ] 
Read : 03.6 : 0.27 : 1 
Read : 16.9 : 0.50 : 10 
Read : 28.0 : 0.70 : 10 x 2 [ 2 rados bench processes ] 
Read : 29.6 : 1.37 : 20 x 2 [ 2 rados bench processes ] 
Read : 27.2 : 1.50 : 10 x 4 [ 4 rados bench processes ] 

{2} [B] (defaultlogging, 1 OSD) 
******* 
Write : 01.4 : 0.68 : 1 
Write : 04.0 : 2.35 : 10 
Write : 04.0 : 4.69 : 10 x 2 [ 2 rados bench processes ] 

I also played with OSD thread number (no change) and used an in memory filesystem + journaling (filestore backend). Here the{1} [A] result is 1.4 kHz write for 1 IOPS in flight and the peak write performance putting many IOPS in flight and several rados bench processes is 2.3 kHz! 

Some summarizing remarks: 

1) Default Logging has an important impact on the IOPS & latency [0.1-0.2ms] 
2) OSD implementation without journaling does not scale linear with concurrent IOs - need several OSDs to scale IOPS - lock contention/threading model? 
3) a writing OSD never fills more than 4 cores 
4) a reading OSD never fills more than 5 cores 
5) running 'rados bench' on a remote machine gives similar or slghltly worse results (upto -20%) 
6) CEPH delivering 20k read IOPS uses 4 cores on server side, while identical operations with higher payload (XRootD) uses one core for 3x higher performance (60k IOPS) 
7) I can scale the other IO daemon (XRootD) to use 10 cores and to deliver 300.000 IOPS on the same box. 

Looking forward to SSDs and volatile memory backend stores I see some improvements to be done in the OSD/communication layer. 

If you have some ideas for parameters to tune or see some mistakes in this measurement - let me know! 

Cheers Andreas. 

-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@xxxxxxxxxxxxxxx 
More majordomo info at http://vger.kernel.org/majordomo-info.html 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html