Mark, thanks for putting it down this way. It does make sense. Does it mean that having the Intel 520s, which bypass the dsync is theat to the data stored on the journals? I do have a few of these installed, alongside with 530s. I did not plan to replace them just yet. Would it make more sense to get a small battery protected raid card in front of the 520s and 530s to protect against these types of scenarios? Cheers ----- Original Message ----- > From: "Mark Nelson" <mnelson@xxxxxxxxxx> > To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx> > Cc: ceph-users@xxxxxxxxxxxxxx > Sent: Friday, 19 June, 2015 5:08:31 PM > Subject: Re: rbd performance issue - can't find bottleneck > > On 06/19/2015 10:29 AM, Andrei Mikhailovsky wrote: > > Mark, > > > > Thanks, I do understand that there is a risk of data loss by doing this. > > Having said this, ceph is designed to be fault tollerant and self > > repairing should something happen to individual journals, osds and server > > nodes. Isn't this a still good measure to compromise between data > > integrity and speed? So, by faking dsync and not actually doing this, you > > have a window of opportunity to data loss should a failure happen between > > the last flash and the moment of failure. > > > > Thus, if the ssd disk failure happens, regardless if dsync is used or not, > > would ceph still consider the osds behind the journal to be > > unavailable/lost and migrate the data around anyway and perform the > > necessary checks to make sure the data integrity is not compromised? If > > this is true, I would still consider using the dsync bypass in favour of > > the extra speed benefit. Unless I am missing a bigger picture and > > miscalculated something. > > > > Could someone please elaborate on this a bit further to understand the > > realy world threat of not using the dsync bypass? > > Hi Andrei, > > Basically the entire point of the Ceph journal is to guarantee that data > hits a persistent medium before the write gets acknowledged. Imagine a > scenario where you lose power just as the write happens. > > Scenario A: You have proper O_DSYNC writes. In this case, assuming the > SSD is behaving properly, you can be fairly confident that the write to > the local journal succeeded (or not). > > Scenario B: You bypass O_DSYNC. The journal write "completes" quickly, > but it's not actually written out to flash, just to the drive cache. If > the SSD has power loss protection it can theoretically write that data > out to the flash before it losses power. For this reason, drives with > PLP can often perform O_DSYNC writes very quickly even without this hack > (ie it can ignore ATA_CMD_FLUSH). > > For a drive like the 530 without PLP, there's no guarantee that the data > in cache will hit the flash. Ceph will *think* it did though, and the > risk is worse because the write "completes" so fast. Now you have a > scenario where ceph thinks something exists but it really doesn't (or > exists in a corrupted state). This leads to all sorts of problems. If > another OSD goes down and you have two copies of the data that disagree > with each other, what do you do? What if not all of the replica writes > succeeded but you have a copy of the data on the primary? Can you trust > it? Everything starts breaking down. > > Mark > > > > > Cheers > > > > Andrei > > > > > > ----- Original Message ----- > >> From: "Mark Nelson" <mnelson@xxxxxxxxxx> > >> To: ceph-users@xxxxxxxxxxxxxx > >> Sent: Friday, 19 June, 2015 3:59:55 PM > >> Subject: Re: rbd performance issue - can't find bottleneck > >> > >> > >> > >> On 06/19/2015 09:54 AM, Andrei Mikhailovsky wrote: > >>> Hi guys, > >>> > >>> I also use a combination of intel 520 and 530 for my journals and have > >>> noticed that the latency and the speed of 520s is better than 530s. > >>> > >>> Could someone please confirm that doing the following at start up will > >>> stop > >>> the dsync on the relevant drives? > >>> > >>> # echo temporary write through > > >>> /sys/class/scsi_disk/1\:0\:0\:0/cache_type > >>> > >>> Do I need to patch my kernel for this or is this already implementable in > >>> vanilla? I am running 3.19.x branch from ubuntu testing repo. > >>> > >>> Would the above change the performance of 530s to be more like 520s? > >> > >> I need to comment that it's *really* not a good idea to do this if you > >> care about data integrity. There's a reason why the 530 is slower than > >> the 520. If you need speed and you care about your data, you should > >> really consider jumping up to the DC S3700. > >> > >> There's a possibility that the 730 *may* be ok as it supposedly has > >> power loss protection, but it's still not using HET MLC so the flash > >> cells will wear out faster. It's also a consumer grade drive, so no one > >> will give you support for this kind of use case if you have problems. > >> > >> Mark > >> > >>> > >>> Cheers > >>> > >>> Andrei > >>> > >>> > >>> > >>> ----- Original Message ----- > >>>> From: "Alexandre DERUMIER" <aderumier@xxxxxxxxx> > >>>> To: "Jacek Jarosiewicz" <jjarosiewicz@xxxxxxxxxxxxx> > >>>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> > >>>> Sent: Thursday, 18 June, 2015 11:54:42 AM > >>>> Subject: Re: rbd performance issue - can't find bottleneck > >>>> > >>>> Hi, > >>>> > >>>> for read benchmark > >>>> > >>>> with fio, what is the iodepth ? > >>>> > >>>> my fio 4k randr results with > >>>> > >>>> iodepth=1 : bw=6795.1KB/s, iops=1698 > >>>> iodepth=2 : bw=14608KB/s, iops=3652 > >>>> iodepth=4 : bw=32686KB/s, iops=8171 > >>>> iodepth=8 : bw=76175KB/s, iops=19043 > >>>> iodepth=16 :bw=173651KB/s, iops=43412 > >>>> iodepth=32 :bw=336719KB/s, iops=84179 > >>>> > >>>> (This should be similar with rados bench -t (threads) option). > >>>> > >>>> This is normal because of network latencies + ceph latencies. > >>>> Doing more parallism increase iops. > >>>> > >>>> (doing a bench with "dd" = iodepth=1) > >>>> > >>>> Theses result are with 1 client/rbd volume. > >>>> > >>>> > >>>> now with more fio client (numjobs=X) > >>>> > >>>> I can reach up to 300kiops with 8-10 clients. > >>>> > >>>> > >>>> This should be the same with lauching multiple rados bench in parallel > >>>> > >>>> (BTW, it could be great to have an option in rados bench to do it) > >>>> > >>>> > >>>> ----- Mail original ----- > >>>> De: "Jacek Jarosiewicz" <jjarosiewicz@xxxxxxxxxxxxx> > >>>> À: "Mark Nelson" <mnelson@xxxxxxxxxx>, "ceph-users" > >>>> <ceph-users@xxxxxxxxxxxxxx> > >>>> Envoyé: Jeudi 18 Juin 2015 11:49:11 > >>>> Objet: Re: rbd performance issue - can't find bottleneck > >>>> > >>>> On 06/17/2015 04:19 PM, Mark Nelson wrote: > >>>>>> SSD's are INTEL SSDSC2BW240A4 > >>>>> > >>>>> Ah, if I'm not mistaken that's the Intel 530 right? You'll want to see > >>>>> this thread by Stefan Priebe: > >>>>> > >>>>> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg05667.html > >>>>> > >>>>> In fact it was the difference in Intel 520 and Intel 530 performance > >>>>> that triggered many of the different investigations that have taken > >>>>> place by various folks into SSD flushing behavior on ATA_CMD_FLUSH. The > >>>>> gist of it is that the 520 is very fast but probably not safe. The 530 > >>>>> is safe but not fast. The DC S3700 (and similar drives with super > >>>>> capacitors) are thought to be both fast and safe (though some drives > >>>>> like the crucual M500 and later misrepresented their power loss > >>>>> protection so you have to be very careful!) > >>>>> > >>>> > >>>> Yes, these are Intel 530. > >>>> I did the tests described in the thread You pasted and unfortunately > >>>> that's my case... I think. > >>>> > >>>> The dd run locally on a mounted ssd partition looks like this: > >>>> > >>>> [root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=10000 > >>>> oflag=direct,dsync > >>>> 10000+0 records in > >>>> 10000+0 records out > >>>> 3584000000 bytes (3.6 GB) copied, 211.698 s, 16.9 MB/s > >>>> > >>>> and when I skip the flag dsync it goes fast: > >>>> > >>>> [root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=10000 > >>>> oflag=direct > >>>> 10000+0 records in > >>>> 10000+0 records out > >>>> 3584000000 bytes (3.6 GB) copied, 9.05432 s, 396 MB/s > >>>> > >>>> (I used the same 350k block size as mentioned in the e-mail from the > >>>> thread above) > >>>> > >>>> I tried disabling the dsync like this: > >>>> > >>>> [root@cf02 ~]# echo temporary write through > > >>>> /sys/class/scsi_disk/1\:0\:0\:0/cache_type > >>>> > >>>> [root@cf02 ~]# cat /sys/class/scsi_disk/1\:0\:0\:0/cache_type > >>>> write through > >>>> > >>>> ..and then locally I see the speedup: > >>>> > >>>> [root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=10000 > >>>> oflag=direct,dsync > >>>> 10000+0 records in > >>>> 10000+0 records out > >>>> 3584000000 bytes (3.6 GB) copied, 10.4624 s, 343 MB/s > >>>> > >>>> > >>>> ..but when I test it from a client I still get slow results: > >>>> > >>>> root@cf03:/ceph/tmp# dd if=/dev/zero of=test bs=100M count=100 > >>>> oflag=direct > >>>> 100+0 records in > >>>> 100+0 records out > >>>> 10485760000 bytes (10 GB) copied, 122.482 s, 85.6 MB/s > >>>> > >>>> and fio gives the same 2-3k iops. > >>>> > >>>> after the change to SSD cache_type I tried remounting the test image, > >>>> recreating it and so on - nothing helped. > >>>> > >>>> I ran rbd bench-write on it, and it's not good either: > >>>> > >>>> root@cf03:~# rbd bench-write t2 > >>>> bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq > >>>> SEC OPS OPS/SEC BYTES/SEC > >>>> 1 4221 4220.64 32195919.35 > >>>> 2 9628 4813.95 36286083.00 > >>>> 3 15288 4790.90 35714620.49 > >>>> 4 19610 4902.47 36626193.93 > >>>> 5 24844 4968.37 37296562.14 > >>>> 6 30488 5081.31 38112444.88 > >>>> 7 36152 5164.54 38601615.10 > >>>> 8 41479 5184.80 38860207.38 > >>>> 9 46971 5218.70 39181437.52 > >>>> 10 52219 5221.77 39322641.34 > >>>> 11 56666 5151.36 38761566.30 > >>>> 12 62073 5172.71 38855021.35 > >>>> 13 65962 5073.95 38182880.49 > >>>> 14 71541 5110.02 38431536.17 > >>>> 15 77039 5135.85 38615125.42 > >>>> 16 82133 5133.31 38692578.98 > >>>> 17 87657 5156.24 38849948.84 > >>>> 18 92943 5141.03 38635464.85 > >>>> 19 97528 5133.03 38628548.32 > >>>> 20 103100 5154.99 38751359.30 > >>>> 21 108952 5188.09 38944016.94 > >>>> 22 114511 5205.01 38999594.18 > >>>> 23 120319 5231.17 39138227.64 > >>>> 24 125975 5248.92 39195739.46 > >>>> 25 131438 5257.50 39259023.06 > >>>> 26 136883 5264.72 39344673.41 > >>>> 27 142362 5272.66 39381638.20 > >>>> elapsed: 27 ops: 143789 ops/sec: 5273.01 bytes/sec: 39376124.30 > >>>> > >>>> rados bench gives: > >>>> > >>>> root@cf03:~# rados -p rbd bench 30 write --no-cleanup > >>>> Maintaining 16 concurrent writes of 4194304 bytes for up to 30 seconds > >>>> or 0 objects > >>>> Object prefix: benchmark_data_cf03_21194 > >>>> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > >>>> 0 0 0 0 0 0 - 0 > >>>> 1 16 28 12 47.9863 48 0.779211 0.48964 > >>>> 2 16 43 27 53.9886 60 1.17958 0.775733 > >>>> 3 16 59 43 57.322 64 0.157145 0.798348 > >>>> 4 16 73 57 56.9897 56 0.424493 0.862553 > >>>> 5 16 89 73 58.39 64 0.246444 0.893064 > >>>> 6 16 104 88 58.6569 60 1.67389 0.901757 > >>>> 7 16 120 104 59.4186 64 1.78324 0.935242 > >>>> 8 16 132 116 57.9905 48 1.50035 0.963947 > >>>> 9 16 147 131 58.2128 60 1.85047 0.978697 > >>>> 10 16 161 145 57.9908 56 0.133187 0.999999 > >>>> 11 16 174 158 57.4455 52 1.59548 1.02264 > >>>> 12 16 189 173 57.6577 60 0.179966 1.01623 > >>>> 13 16 206 190 58.4526 68 1.93064 1.02108 > >>>> 14 16 221 205 58.5624 60 1.54504 1.02566 > >>>> 15 16 236 220 58.6578 60 1.69023 1.0301 > >>>> 16 16 251 235 58.7411 60 1.5683 1.02514 > >>>> 17 16 263 247 58.1089 48 1.99782 1.0293 > >>>> 18 16 278 262 58.2136 60 2.03487 1.03552 > >>>> 19 16 295 279 58.7282 68 0.292065 1.03412 > >>>> 20 16 310 294 58.7913 60 1.61331 1.0436 > >>>> 21 16 323 307 58.4675 52 0.161555 1.04393 > >>>> 22 16 335 319 57.9914 48 1.55905 1.05392 > >>>> 23 16 351 335 58.2523 64 0.317811 1.04937 > >>>> 24 16 369 353 58.8247 72 1.76145 1.05415 > >>>> 25 16 383 367 58.7114 56 1.25224 1.05758 > >>>> 26 16 399 383 58.9145 64 1.46604 1.05593 > >>>> 27 16 414 398 58.9544 60 0.349479 1.04213 > >>>> 28 16 431 415 59.2771 68 0.74857 1.04895 > >>>> 29 16 448 432 59.5776 68 1.16596 1.04986 > >>>> 30 16 464 448 59.7247 64 0.195269 1.04202 > >>>> 31 16 465 449 57.9271 4 1.25089 1.04249 > >>>> Total time run: 31.407987 > >>>> Total writes made: 465 > >>>> Write size: 4194304 > >>>> Bandwidth (MB/sec): 59.221 > >>>> > >>>> Stddev Bandwidth: 15.5579 > >>>> Max bandwidth (MB/sec): 72 > >>>> Min bandwidth (MB/sec): 0 > >>>> Average Latency: 1.07412 > >>>> Stddev Latency: 0.691676 > >>>> Max latency: 2.52896 > >>>> Min latency: 0.113751 > >>>> > >>>> and reading: > >>>> > >>>> root@cf03:/ceph/tmp# rados -p rbd bench 30 rand > >>>> sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > >>>> 0 0 0 0 0 0 - 0 > >>>> 1 16 43 27 107.964 108 0.650441 0.415883 > >>>> 2 16 71 55 109.972 112 0.624493 0.485735 > >>>> 3 16 100 84 111.975 116 0.77036 0.518524 > >>>> 4 16 128 112 111.977 112 0.329123 0.522431 > >>>> 5 16 155 139 111.179 108 0.702401 0.538305 > >>>> 6 16 184 168 111.979 116 0.7502 0.543431 > >>>> 7 16 213 197 112.551 116 0.46755 0.547047 > >>>> 8 16 240 224 111.981 108 0.430872 0.548855 > >>>> 9 16 268 252 111.981 112 0.740558 0.550753 > >>>> 10 16 297 281 112.381 116 0.340352 0.551335 > >>>> 11 16 325 309 112.345 112 1.14164 0.544646 > >>>> 12 16 353 337 112.315 112 0.46038 0.555206 > >>>> 13 16 382 366 112.597 116 0.727224 0.556029 > >>>> 14 16 410 394 112.553 112 0.673523 0.557172 > >>>> 15 16 438 422 112.516 112 0.543171 0.558385 > >>>> 16 16 466 450 112.482 112 0.370119 0.557367 > >>>> 17 16 494 478 112.453 112 0.89322 0.556681 > >>>> 18 16 522 506 112.427 112 0.651126 0.559601 > >>>> 19 16 551 535 112.614 116 0.801207 0.55739 > >>>> 20 16 579 563 112.583 112 0.92365 0.558744 > >>>> 21 16 607 591 112.554 112 0.679443 0.55983 > >>>> 22 16 635 619 112.528 112 0.273806 0.557695 > >>>> 23 16 664 648 112.679 116 0.33258 0.559718 > >>>> 24 15 691 676 112.65 112 0.141288 0.559192 > >>>> 25 16 720 704 112.623 112 0.901803 0.559435 > >>>> 26 16 748 732 112.598 112 0.807202 0.559793 > >>>> 27 16 776 760 112.576 112 0.747424 0.561044 > >>>> 28 16 805 789 112.698 116 0.817418 0.560835 > >>>> 29 16 833 817 112.673 112 0.711397 0.562342 > >>>> 30 16 861 845 112.65 112 0.520696 0.562809 > >>>> Total time run: 30.547818 > >>>> Total reads made: 861 > >>>> Read size: 4194304 > >>>> Bandwidth (MB/sec): 112.741 > >>>> > >>>> Average Latency: 0.566574 > >>>> Max latency: 1.2147 > >>>> Min latency: 0.06128 > >>>> > >>>> > >>>> so.. in order to increase performance, do I need to change the ssd > >>>> drives? > >>>> > >>>> J > >>>> > >>>> -- > >>>> Jacek Jarosiewicz > >>>> Administrator Systemów Informatycznych > >>>> > >>>> ---------------------------------------------------------------------------------------- > >>>> SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie > >>>> ul. Senatorska 13/15, 00-075 Warszawa > >>>> Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego > >>>> Rejestru Sądowego, > >>>> nr KRS 0000029537; kapitał zakładowy 42.756.000 zł > >>>> NIP: 957-05-49-503 > >>>> Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa > >>>> > >>>> ---------------------------------------------------------------------------------------- > >>>> SUPERMEDIA -> http://www.supermedia.pl > >>>> dostep do internetu - hosting - kolokacja - lacza - telefonia > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users@xxxxxxxxxxxxxx > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users@xxxxxxxxxxxxxx > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>> > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@xxxxxxxxxxxxxx > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com