Re: rbd performance issue - can't find bottleneck

Mark Nelson <mnelson@xxxxxxxxxx> · Fri, 19 Jun 2015 11:08:31 -0500

On 06/19/2015 10:29 AM, Andrei Mikhailovsky wrote:
Mark,

Thanks, I do understand that there is a risk of data loss by doing this. Having said this, ceph is designed to be fault tollerant and self repairing should something happen to individual journals, osds and server nodes. Isn't this a still good measure to compromise between data integrity and speed? So, by faking dsync and not actually doing this, you have a window of opportunity to data loss should a failure happen between the last flash and the moment of failure.

Thus, if the ssd disk failure happens, regardless if dsync is used or not, would ceph still consider the osds behind the journal to be unavailable/lost and migrate the data around anyway and perform the necessary checks to make sure the data integrity is not compromised? If this is true, I would still consider using the dsync bypass in favour of the extra speed benefit. Unless I am missing a bigger picture and miscalculated something.

Could someone please elaborate on this a bit further to understand the realy world threat of not using the dsync bypass?

Hi Andrei,

Basically the entire point of the Ceph journal is to guarantee that data 
hits a persistent medium before the write gets acknowledged.  Imagine a 
scenario where you lose power just as the write happens.

Scenario A:  You have proper O_DSYNC writes.  In this case, assuming the 
SSD is behaving properly, you can be fairly confident that the write to 
the local journal succeeded (or not).

Scenario B: You bypass O_DSYNC.  The journal write "completes" quickly, 
but it's not actually written out to flash, just to the drive cache.  If 
the SSD has power loss protection it can theoretically write that data 
out to the flash before it losses power.  For this reason, drives with 
PLP can often perform O_DSYNC writes very quickly even without this hack 
(ie it can ignore ATA_CMD_FLUSH).

For a drive like the 530 without PLP, there's no guarantee that the data 
in cache will hit the flash.  Ceph will *think* it did though, and the 
risk is worse because the write "completes" so fast.  Now you have a 
scenario where ceph thinks something exists but it really doesn't (or 
exists in a corrupted state).  This leads to all sorts of problems.  If 
another OSD goes down and you have two copies of the data that disagree 
with each other, what do you do?  What if not all of the replica writes 
succeeded but you have a copy of the data on the primary?  Can you trust 
it?  Everything starts breaking down.

Mark

Cheers

Andrei

----- Original Message -----
From: "Mark Nelson" <mnelson@xxxxxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx
Sent: Friday, 19 June, 2015 3:59:55 PM
Subject: Re:  rbd performance issue - can't find bottleneck

On 06/19/2015 09:54 AM, Andrei Mikhailovsky wrote:
Hi guys,

I also use a combination of intel 520 and 530 for my journals and have
noticed that the latency and the speed of 520s is better than 530s.

Could someone please confirm that doing the following at start up will stop
the dsync on the relevant drives?

# echo temporary write through > /sys/class/scsi_disk/1\:0\:0\:0/cache_type

Do I need to patch my kernel for this or is this already implementable in
vanilla? I am running 3.19.x branch from ubuntu testing repo.

Would the above change the performance of 530s to be more like 520s?

I need to comment that it's *really* not a good idea to do this if you
care about data integrity.  There's a reason why the 530 is slower than
the 520.  If you need speed and you care about your data, you should
really consider jumping up to the DC S3700.

There's a possibility that the 730 *may* be ok as it supposedly has
power loss protection, but it's still not using HET MLC so the flash
cells will wear out faster.  It's also a consumer grade drive, so no one
will give you support for this kind of use case if you have problems.

Mark

Cheers

Andrei

----- Original Message -----
From: "Alexandre DERUMIER" <aderumier@xxxxxxxxx>
To: "Jacek Jarosiewicz" <jjarosiewicz@xxxxxxxxxxxxx>
Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Sent: Thursday, 18 June, 2015 11:54:42 AM
Subject: Re:  rbd performance issue - can't find bottleneck

Hi,

for read benchmark

with fio, what is the iodepth ?

my fio 4k randr results with

iodepth=1 : bw=6795.1KB/s, iops=1698
iodepth=2 : bw=14608KB/s, iops=3652
iodepth=4 : bw=32686KB/s, iops=8171
iodepth=8 : bw=76175KB/s, iops=19043
iodepth=16 :bw=173651KB/s, iops=43412
iodepth=32 :bw=336719KB/s, iops=84179

(This should be similar with rados bench -t (threads) option).

This is normal because of network latencies + ceph latencies.
Doing more parallism increase iops.

(doing a bench with "dd" = iodepth=1)

Theses result are with 1 client/rbd volume.

now with more fio client (numjobs=X)

I can reach up to 300kiops with 8-10 clients.

This should be the same with lauching multiple rados bench in parallel

(BTW, it could be great to have an option in rados bench to do it)

----- Mail original -----
De: "Jacek Jarosiewicz" <jjarosiewicz@xxxxxxxxxxxxx>
À: "Mark Nelson" <mnelson@xxxxxxxxxx>, "ceph-users"
<ceph-users@xxxxxxxxxxxxxx>
Envoyé: Jeudi 18 Juin 2015 11:49:11
Objet: Re:  rbd performance issue - can't find bottleneck

On 06/17/2015 04:19 PM, Mark Nelson wrote:
SSD's are INTEL SSDSC2BW240A4

Ah, if I'm not mistaken that's the Intel 530 right? You'll want to see
this thread by Stefan Priebe:

https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg05667.html

In fact it was the difference in Intel 520 and Intel 530 performance
that triggered many of the different investigations that have taken
place by various folks into SSD flushing behavior on ATA_CMD_FLUSH. The
gist of it is that the 520 is very fast but probably not safe. The 530
is safe but not fast. The DC S3700 (and similar drives with super
capacitors) are thought to be both fast and safe (though some drives
like the crucual M500 and later misrepresented their power loss
protection so you have to be very careful!)

Yes, these are Intel 530.
I did the tests described in the thread You pasted and unfortunately
that's my case... I think.

The dd run locally on a mounted ssd partition looks like this:

[root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=10000
oflag=direct,dsync
10000+0 records in
10000+0 records out
3584000000 bytes (3.6 GB) copied, 211.698 s, 16.9 MB/s

and when I skip the flag dsync it goes fast:

[root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=10000
oflag=direct
10000+0 records in
10000+0 records out
3584000000 bytes (3.6 GB) copied, 9.05432 s, 396 MB/s

(I used the same 350k block size as mentioned in the e-mail from the
thread above)

I tried disabling the dsync like this:

[root@cf02 ~]# echo temporary write through >
/sys/class/scsi_disk/1\:0\:0\:0/cache_type

[root@cf02 ~]# cat /sys/class/scsi_disk/1\:0\:0\:0/cache_type
write through

..and then locally I see the speedup:

[root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=10000
oflag=direct,dsync
10000+0 records in
10000+0 records out
3584000000 bytes (3.6 GB) copied, 10.4624 s, 343 MB/s

..but when I test it from a client I still get slow results:

root@cf03:/ceph/tmp# dd if=/dev/zero of=test bs=100M count=100
oflag=direct
100+0 records in
100+0 records out
10485760000 bytes (10 GB) copied, 122.482 s, 85.6 MB/s

and fio gives the same 2-3k iops.

after the change to SSD cache_type I tried remounting the test image,
recreating it and so on - nothing helped.

I ran rbd bench-write on it, and it's not good either:

root@cf03:~# rbd bench-write t2
bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq
SEC OPS OPS/SEC BYTES/SEC
1 4221 4220.64 32195919.35
2 9628 4813.95 36286083.00
3 15288 4790.90 35714620.49
4 19610 4902.47 36626193.93
5 24844 4968.37 37296562.14
6 30488 5081.31 38112444.88
7 36152 5164.54 38601615.10
8 41479 5184.80 38860207.38
9 46971 5218.70 39181437.52
10 52219 5221.77 39322641.34
11 56666 5151.36 38761566.30
12 62073 5172.71 38855021.35
13 65962 5073.95 38182880.49
14 71541 5110.02 38431536.17
15 77039 5135.85 38615125.42
16 82133 5133.31 38692578.98
17 87657 5156.24 38849948.84
18 92943 5141.03 38635464.85
19 97528 5133.03 38628548.32
20 103100 5154.99 38751359.30
21 108952 5188.09 38944016.94
22 114511 5205.01 38999594.18
23 120319 5231.17 39138227.64
24 125975 5248.92 39195739.46
25 131438 5257.50 39259023.06
26 136883 5264.72 39344673.41
27 142362 5272.66 39381638.20
elapsed: 27 ops: 143789 ops/sec: 5273.01 bytes/sec: 39376124.30

rados bench gives:

root@cf03:~# rados -p rbd bench 30 write --no-cleanup
Maintaining 16 concurrent writes of 4194304 bytes for up to 30 seconds
or 0 objects
Object prefix: benchmark_data_cf03_21194
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 28 12 47.9863 48 0.779211 0.48964
2 16 43 27 53.9886 60 1.17958 0.775733
3 16 59 43 57.322 64 0.157145 0.798348
4 16 73 57 56.9897 56 0.424493 0.862553
5 16 89 73 58.39 64 0.246444 0.893064
6 16 104 88 58.6569 60 1.67389 0.901757
7 16 120 104 59.4186 64 1.78324 0.935242
8 16 132 116 57.9905 48 1.50035 0.963947
9 16 147 131 58.2128 60 1.85047 0.978697
10 16 161 145 57.9908 56 0.133187 0.999999
11 16 174 158 57.4455 52 1.59548 1.02264
12 16 189 173 57.6577 60 0.179966 1.01623
13 16 206 190 58.4526 68 1.93064 1.02108
14 16 221 205 58.5624 60 1.54504 1.02566
15 16 236 220 58.6578 60 1.69023 1.0301
16 16 251 235 58.7411 60 1.5683 1.02514
17 16 263 247 58.1089 48 1.99782 1.0293
18 16 278 262 58.2136 60 2.03487 1.03552
19 16 295 279 58.7282 68 0.292065 1.03412
20 16 310 294 58.7913 60 1.61331 1.0436
21 16 323 307 58.4675 52 0.161555 1.04393
22 16 335 319 57.9914 48 1.55905 1.05392
23 16 351 335 58.2523 64 0.317811 1.04937
24 16 369 353 58.8247 72 1.76145 1.05415
25 16 383 367 58.7114 56 1.25224 1.05758
26 16 399 383 58.9145 64 1.46604 1.05593
27 16 414 398 58.9544 60 0.349479 1.04213
28 16 431 415 59.2771 68 0.74857 1.04895
29 16 448 432 59.5776 68 1.16596 1.04986
30 16 464 448 59.7247 64 0.195269 1.04202
31 16 465 449 57.9271 4 1.25089 1.04249
Total time run: 31.407987
Total writes made: 465
Write size: 4194304
Bandwidth (MB/sec): 59.221

Stddev Bandwidth: 15.5579
Max bandwidth (MB/sec): 72
Min bandwidth (MB/sec): 0
Average Latency: 1.07412
Stddev Latency: 0.691676
Max latency: 2.52896
Min latency: 0.113751

and reading:

root@cf03:/ceph/tmp# rados -p rbd bench 30 rand
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 43 27 107.964 108 0.650441 0.415883
2 16 71 55 109.972 112 0.624493 0.485735
3 16 100 84 111.975 116 0.77036 0.518524
4 16 128 112 111.977 112 0.329123 0.522431
5 16 155 139 111.179 108 0.702401 0.538305
6 16 184 168 111.979 116 0.7502 0.543431
7 16 213 197 112.551 116 0.46755 0.547047
8 16 240 224 111.981 108 0.430872 0.548855
9 16 268 252 111.981 112 0.740558 0.550753
10 16 297 281 112.381 116 0.340352 0.551335
11 16 325 309 112.345 112 1.14164 0.544646
12 16 353 337 112.315 112 0.46038 0.555206
13 16 382 366 112.597 116 0.727224 0.556029
14 16 410 394 112.553 112 0.673523 0.557172
15 16 438 422 112.516 112 0.543171 0.558385
16 16 466 450 112.482 112 0.370119 0.557367
17 16 494 478 112.453 112 0.89322 0.556681
18 16 522 506 112.427 112 0.651126 0.559601
19 16 551 535 112.614 116 0.801207 0.55739
20 16 579 563 112.583 112 0.92365 0.558744
21 16 607 591 112.554 112 0.679443 0.55983
22 16 635 619 112.528 112 0.273806 0.557695
23 16 664 648 112.679 116 0.33258 0.559718
24 15 691 676 112.65 112 0.141288 0.559192
25 16 720 704 112.623 112 0.901803 0.559435
26 16 748 732 112.598 112 0.807202 0.559793
27 16 776 760 112.576 112 0.747424 0.561044
28 16 805 789 112.698 116 0.817418 0.560835
29 16 833 817 112.673 112 0.711397 0.562342
30 16 861 845 112.65 112 0.520696 0.562809
Total time run: 30.547818
Total reads made: 861
Read size: 4194304
Bandwidth (MB/sec): 112.741

Average Latency: 0.566574
Max latency: 1.2147
Min latency: 0.06128

so.. in order to increase performance, do I need to change the ssd drives?

J

--
Jacek Jarosiewicz
Administrator Systemów Informatycznych

----------------------------------------------------------------------------------------
SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie
ul. Senatorska 13/15, 00-075 Warszawa
Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego
Rejestru Sądowego,
nr KRS 0000029537; kapitał zakładowy 42.756.000 zł
NIP: 957-05-49-503
Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa

----------------------------------------------------------------------------------------
SUPERMEDIA -> http://www.supermedia.pl
dostep do internetu - hosting - kolokacja - lacza - telefonia
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com